A fuzzy logic C++ library
Public Member Functions | Private Member Functions | Private Attributes
slifis::DATA_SET Class Reference

This class is intended to hold all the loaded data points for further processing. More...

#include <data_set.hpp>

+ Collaboration diagram for slifis::DATA_SET:

List of all members.

Public Member Functions

 DATA_SET ()
 Constructor.
size_t GetNbPoints () const
void Clear ()
 Clears the set.
void Print (FILE *f, const char *msg=0, bool PrintRaw=false) const
 Prints data set in file.
void ReadData (DATAFILE_INFO &dfi)
 Reads the whole dataset from file described by dfi, throws an error on failure.
const VALUE_PTR GetOutValue (size_t sample_idx) const
 Returns the scalar output value of sample sample_idx (starting from 0)
void GetInputValues (size_t sample_idx, std::vector< double > &values) const
 Returns a vector of all the *numerical* input values, for sample sample_idx (starting from 0)
size_t GetOutputIndex () const
size_t GetNbInputFields () const
 Returns nb of input fields.
size_t GetNbFields () const
 Returns the number of fields of the dataset.
EN_DATA_FIELD_TYPE GetFieldType (size_t i) const
 Returns field type.
void GetFieldTypeIndexes (EN_DATA_FIELD_TYPE ft, std::vector< size_t > &v) const
 Returns in v the indexes of fields that are of type ft.
Getting statistical information about set
const DATASET_PROPERTIESGetProperties () const
 Returns properties of the dataset.
double GetInMMValue (EN_MinMaxValue mm, size_t i) const
 Returns min/max input value of data set, index is NOT related to original column in file.
double GetOutMMValue (EN_MinMaxValue mm) const
 Returns min/max value of all the output values of the dataset.
Retrieving and adding points
const DATA_POINTGetDataPoint (size_t idx) const
DATA_POINTGetDataPoint (size_t idx)
void AddDataPoint (const DATA_POINT &dpt)
 Add data point to the data set.
Description related functions
void AssignDescription (const DATA_DESCR &)
 Check adequacy between description descr and the dataset. If ok, assigns description to dataset and returns true.
DATA_DESCR GetDescription () const
String-data related functions
const std::string & GetStringValue (size_t col_idx, size_t string_index) const
 Returns string value in case of relational string-handling, col_idx is the column index.
size_t GetStringCount (size_t col_idx, size_t string_elem) const
 Returns string counter value in case of relational string-handling, col_idx is the column index.
size_t GetNbClasses (size_t col_idx) const
 Returns the number of different values.
size_t AddStringItem (size_t pointfield_idx, size_t stringfield_idx, const std::string &str_value)
 Adds the string item str_value to the repository of string attributes, and returns the index on it.
Producing subsets
void GetSubset (const std::vector< size_t > &v_idx, const INPUT_SETS &inputsets, DATA_SET &subset) const
 Computes a subset of dataset that contains only data points that are in the support (fuzzy support) of inputs membership functions defined by vector v_idx;.
void GetSubset (const std::vector< size_t > &v_idx, const INPUT_SETS &inputsets, double threshold, DATA_SET &subset) const
 Computes a subset of dataset that contains only data points that are in the support (fuzzy support) of inputs membership functions defined by vector v_idx;.
void DivideSet (DATA_SET &subset_A, DATA_SET &subset_B, size_t IntervalIdx, size_t NbIntervals=5) const
 Separates the points of the set into two subsets: subset_A and subset_B, according to IntervalIdx and NbIntervals.

Private Member Functions

void p_ComputeProperties () const
void p_Init ()

Private Attributes

std::vector< std::vector
< PAIR_STRING_COUNT > > 
_vv_StringData
 Will hold the string items of the dataset.
std::vector< int > _v_StringIndexes
 hold the indexes of the strings related to the columns
bool _HasAssignedDescription
 true: means that a description has been assigned to the dataset (i.e. it is not a "generic" description)
DATA_DESCR _data_descr
std::vector< DATA_POINT_v_datapoint
 vector of values
bool _props_are_computed
DATASET_PROPERTIES _properties

Detailed Description

This class is intended to hold all the loaded data points for further processing.

It is mainly useful in the learning step for Takagi-Sugeno FIS, where it can be used to get a subset of the points that matches some input interval.

Data is a stl::vector of output values, associated to a vector of input values (each of these being of course a vector, of size equal to the number of inputs of the FIS)

It can be filled by reading a file, CSV or ARFF (Weka) format

Please note that lines of data have a maximum length of BUF_SIZE, defined in helper_functions.hpp

For string attributes, the values are NOT stored in the DATA_POINT object, but in a separate vector of vectors, that holds the string values in a relational way: the data point only holds indexes on this vector.

See also Dataset handling

Related classes:


Constructor & Destructor Documentation

Constructor.

References p_Init().


Member Function Documentation

size_t slifis::DATA_SET::GetNbPoints ( ) const [inline]

Clears the set.

Referenced by DivideSet(), and GetSubset().

void slifis::DATA_SET::Print ( FILE *  f,
const char *  msg = 0,
bool  PrintRaw = false 
) const

Prints data set in file.

For string fields, recall that they are now stored in a relational way, so they are actually stored as indexes. However, in that case, the DATA_POINT::GetValue() function returns the string itself, that is either of type DT_NUMERIC or DT_STRING. Unless you call it with 'true' as second argument.

References __IN__, __OUT__, slifis::DT_NUMERIC, slifis::DT_STRING, slifis::DT_STRING_INDEX, slifis::ERR_DATA_BAD_TYPE, slifis::DATA_POINT::GetPointId(), slifis::DATA_POINT::GetValue(), and SLIFIS_ERROR_1.

Referenced by main().

Reads the whole dataset from file described by dfi, throws an error on failure.

-If the dfi argument has been assigned a description, then

  • only the requested fields will be copied into dataset
  • this description will be copied into the dataset and adjusted

Else, a generic description will be generated for this dataset.

Medium Priority Todo:
the following statement is incorrect: the data file might have 10 string fields, but maybe we want only 2 of them

References __IN__, __OUT__, slifis::DATAFILE_INFO::_NbDataPts, slifis::DATAFILE_INFO::CloseFile(), slifis::DATA_DESCR::ComputeIndexesAfterLoading(), slifis::DFT_ARFF, slifis::DFT_CSV, slifis::ERR_IO_ERROR, slifis::DATAFILE_INFO::FileIsGood(), slifis::DATAFILE_INFO::GetDescription(), slifis::DATAFILE_INFO::GetFileType(), slifis::DATAFILE_INFO::GetNbStringFields(), slifis::DATAFILE_INFO::GetTotNbFields(), slifis::DATAFILE_INFO::HasDescription(), slifis::DATAFILE_INFO::IsSet(), slifis::DATAFILE_INFO::OpenFile(), slifis::DATA_POINT::ReadDataFields(), slifis::DATA_DESCR::SetOutputColumn(), SLIFIS_ERROR, SLIFIS_ERROR_1, SLIFIS_ERROR_2, SLIFIS_ERROR_LOG, slifis::ST_DATALINE, and slifis::ST_FAILURE.

Referenced by main().

Returns properties of the dataset.

References __IN__, and __OUT__.

Referenced by main(), and process_numeric().

double slifis::DATA_SET::GetInMMValue ( EN_MinMaxValue  mm,
size_t  i 
) const

Returns min/max input value of data set, index is NOT related to original column in file.

References __IN__, slifis::ERR_DATA_BAD_INDEX, slifis::MM_Max, slifis::MM_Min, and SLIFIS_ERROR_2.

Referenced by main().

Returns min/max value of all the output values of the dataset.

References slifis::MM_Max, and slifis::MM_Min.

const VALUE_PTR slifis::DATA_SET::GetOutValue ( size_t  sample_idx) const

Returns the scalar output value of sample sample_idx (starting from 0)

Data set must have a description

References __IN__, __OUT__, slifis::ERR_DATA_BAD_INDEX, slifis::DATA_POINT::GetValue(), SLIFIS_ERROR_2, and VECTOR_ELEM.

Referenced by slifis::SLIFIS::BuildTSRulesFromValues(), slifis::RULE_IDX::ComputeTSError(), and main().

void slifis::DATA_SET::GetInputValues ( size_t  sample_idx,
std::vector< double > &  values 
) const

Returns a vector of all the *numerical* input values, for sample sample_idx (starting from 0)

Warning:
At present, if an input field is not numerical, then its values will be 0

References __IN__, __OUT__, slifis::ERR_DATA_BAD_INDEX, slifis::DATA_POINT::GetValue(), and SLIFIS_ERROR_2.

Referenced by slifis::SLIFIS::BuildTSRulesFromValues(), and main().

size_t slifis::DATA_SET::GetOutputIndex ( ) const [inline]

References _data_descr, and slifis::DATA_DESCR::GetOutputIndex().

Referenced by main().

size_t slifis::DATA_SET::GetNbInputFields ( ) const [inline]

Returns nb of input fields.

Low Priority Todo:
What should happen here if there is no data loaded ?

References _data_descr, and slifis::DATA_DESCR::GetNbInputs().

Referenced by slifis::RULE_IDX::ComputeTSError().

size_t slifis::DATA_SET::GetNbFields ( ) const [inline]

Returns the number of fields of the dataset.

If the dataset has a description but has no data loaded, it will return then expected number of fields, as specified by the description

References __IN__, __OUT__, _data_descr, _HasAssignedDescription, _v_datapoint, slifis::DATA_DESCR::GetNbInputs(), and GetNbPoints().

Referenced by slifis::SLIFIS::BuildRuleBaseFromData(), main(), and slifis::DATASET_PROPERTIES::P_ComputeProps().

EN_DATA_FIELD_TYPE slifis::DATA_SET::GetFieldType ( size_t  idx) const [inline]

Returns field type.

Warning:
Points must have been read, as this information is retrieved from a point, and is not stored in class DATA_SET

References __IN__, __OUT__, slifis::ERR_DATA_NO_POINTS, GetDataPoint(), slifis::DATA_POINT::GetDataType(), GetNbPoints(), and SLIFIS_ERROR.

Referenced by main(), and slifis::DATASET_PROPERTIES::P_ComputeProps().

void slifis::DATA_SET::GetFieldTypeIndexes ( EN_DATA_FIELD_TYPE  ft,
std::vector< size_t > &  v 
) const

Returns in v the indexes of fields that are of type ft.

References __IN__, and __OUT__.

const DATA_POINT & slifis::DATA_SET::GetDataPoint ( size_t  idx) const [inline]
DATA_POINT & slifis::DATA_SET::GetDataPoint ( size_t  idx) [inline]
void slifis::DATA_SET::AddDataPoint ( const DATA_POINT dpt) [inline]

Add data point to the data set.

References _v_datapoint.

Referenced by DivideSet(), GetSubset(), and main().

Check adequacy between description descr and the dataset. If ok, assigns description to dataset and returns true.

Else, it will throw an error

References __IN__, slifis::ERR_DATA_BAD_TYPE, slifis::DATA_DESCR::GetHighestIndex(), slifis::DATA_DESCR::P_CheckOutputNotInInputs(), SLIFIS_ERROR_1, and SLIFIS_ERROR_2.

Referenced by GetSubset(), and main().

const std::string & slifis::DATA_SET::GetStringValue ( size_t  col_idx,
size_t  string_index 
) const

Returns string value in case of relational string-handling, col_idx is the column index.

References __IN__, and __OUT__.

Referenced by main().

size_t slifis::DATA_SET::GetStringCount ( size_t  col_idx,
size_t  string_elem 
) const

Returns string counter value in case of relational string-handling, col_idx is the column index.

References __IN__, __OUT__, slifis::ERR_DATA_BAD_INDEX, and SLIFIS_ERROR_2.

Referenced by main().

size_t slifis::DATA_SET::GetNbClasses ( size_t  col_idx) const

Returns the number of different values.

Referenced by main().

size_t slifis::DATA_SET::AddStringItem ( size_t  pointfield_idx,
size_t  stringfield_idx,
const std::string &  str_value 
)

Adds the string item str_value to the repository of string attributes, and returns the index on it.

References __IN__, __OUT__, slifis::ERR_DATA_BAD_INDEX, and SLIFIS_ERROR_2.

void slifis::DATA_SET::GetSubset ( const std::vector< size_t > &  v_idx,
const INPUT_SETS inputsets,
DATA_SET subset 
) const

Computes a subset of dataset that contains only data points that are in the support (fuzzy support) of inputs membership functions defined by vector v_idx;.

Say we have a FIS with 3 inputs, and we request as vector v_idx the values 1,0,2. This means we request the values that will be inside MF(1) for first input, inside MF(0) for second input, and MF(2) for third input.

Motivation: this function is required for learning from data with a TS Fis type. It is needed to compute the TS coefficients value from only the subset of points that matches the requirements (expressed by the membership functions). See SLIFIS::BuildTSRulesFromValues()

The support is defined by having a fuzzy value higher than 0

See also twin function void GetSubset( const std::vector<size_t>& v_idx, const INPUT_SETS& inputsets, double threshold, DATA_SET& subset ) const; (using a different algorithm)

Bug:
(or feature...) This function considers only points that are higher than the lowest value of the membership functions or lower than the highest value of the membership functions. So if the MF has part of its range to 0, the points will be considered valid and thus added to the subset.
Medium Priority Todo:
Needs to be tested!
Parameters:
v_idxrequested input vector combination
inputsetsinput sets
subsetoutput dataset

References __IN__, __OUT__, AddDataPoint(), AssignDescription(), Clear(), slifis::DATA_POINT::FillWithInputValues(), slifis::MEMBFUNC::GetFirstPoint(), slifis::MEMBFUNC::GetLastPoint(), slifis::FUZZY_ROOT::GetMf(), slifis::INPUT_SETS::GetMfSet(), slifis::INPUT_SETS::GetNb(), slifis::FUZZY_ROOT::GetNbMf(), and slifis::MEMBFUNC::IsFinite().

Referenced by slifis::SLIFIS::BuildTSRulesFromValues().

void slifis::DATA_SET::GetSubset ( const std::vector< size_t > &  v_idx,
const INPUT_SETS inputsets,
double  threshold,
DATA_SET subset 
) const

Computes a subset of dataset that contains only data points that are in the support (fuzzy support) of inputs membership functions defined by vector v_idx;.

Say we have a FIS with 3 inputs, and we request as vector v_idx the values 1,0,2. This means we request the values that will be inside MF(1) for first input, inside MF(0) for second input, and MF(2) for third input.

The support is defined by having a fuzzy value higher than threshold

Motivation: this function is required for learning from data with a TS Fis type. It is needed to compute the TS coefficients value from only the subset of points that matches the requirements (expressed by the membership functions). See SLIFIS::BuildTSRulesFromValues()

See also twin function (uses a different algorithm): void GetSubset( const std::vector<size_t>& v_idx, const INPUT_SETS& inputsets, DATA_SET& subset ) const

Medium Priority Todo:
Needs to be tested!
Parameters:
v_idxrequested input vector combination
inputsetsinput sets
thresholdfuzzy threshold. We don't use FUZZYVAL to reduce dependencies between the code handling data and the code related to fuzzy logic.
subsetoutput dataset

References __IN__, __OUT__, AddDataPoint(), AssignDescription(), Clear(), slifis::MEMBFUNC::Fuzzify(), slifis::DATA_POINT::GetInputValue(), slifis::FUZZY_ROOT::GetMf(), slifis::INPUT_SETS::GetMfSet(), and slifis::INPUT_SETS::GetNb().

void slifis::DATA_SET::DivideSet ( DATA_SET subset_A,
DATA_SET subset_B,
size_t  IntervalIdx,
size_t  NbIntervals = 5 
) const

Separates the points of the set into two subsets: subset_A and subset_B, according to IntervalIdx and NbIntervals.

If the set has n points, then we divide it into NbIntervals points and return in subset_A the part defined by IntervalIdx. subset_B will be filled with the rest of the points.

References __IN__, __OUT__, AddDataPoint(), Clear(), slifis::ERR_DATA_BAD_INDEX, and SLIFIS_ERROR_2.

void slifis::DATA_SET::p_ComputeProperties ( ) const [private]

References __IN__, and __OUT__.

void slifis::DATA_SET::p_Init ( ) [private]

Referenced by DATA_SET().


Member Data Documentation

std::vector< std::vector< PAIR_STRING_COUNT > > slifis::DATA_SET::_vv_StringData [private]

Will hold the string items of the dataset.

For example, if column 1 and 3 hold string values, then the string items from column 1 will be stored in _vv_StringData[0] (first element) and the string items from column 3 will be stored in _vv_StringData[1] (second element)

The stored type (a std::pair) allows to store both the string value and the associated counter

std::vector< int > slifis::DATA_SET::_v_StringIndexes [private]

hold the indexes of the strings related to the columns

For example, say we have 6 columns (indexes 0 to 5), with column 2 and 5 holding string values. Then _v_StringIndexes[1]=0 (1: second column) and _v_StringIndexes[4]=1 (4: fifth column), while we will have

true: means that a description has been assigned to the dataset (i.e. it is not a "generic" description)

Referenced by GetNbFields().

vector of values

Referenced by AddDataPoint(), GetDataPoint(), GetNbFields(), and GetNbPoints().

bool slifis::DATA_SET::_props_are_computed [mutable, private]