Dataset handling

This page describes the provided code for handling data.

NAVIGATION

Summary : Users guide
Previous page : File Storage
Next page : Robustness and error handling

Page content:

Summary:

The user application can load data from a data file, stored as tabular data.
One can load either all the columns (fields) or only part of these, by providing a description.
One can load in memory all the lines at once, or read data one line at a time.
Data can be numerical or string type.
File formats can be CSV or ARFF.
Statistical information on loaded data is available.

Definitions: A datapoint is a set of related scalar values (string or numerical) of any size.

Usable file formats

At present, two formats are accepted for tabular input data:

CSV files ("comma separated values");
ARFF files, the format used by the weka framework.

This is stored through enum slifis::EN_DF_TYPE. The format of the file is automatically detected from the extension.

The fields (aka "attributes") can be of two types, numerical and string. This is stored using slifis::EN_DATA_FIELD_TYPE.

ARFF format

.Arff is a common format for storing data, it is associated with Weka, a well-known machine learning tool. Some samples can be found in the folder bin/sample_data.

See followings links on Weka/ARFF format:

CSV format

Csv are a classical way for storing datapoints. The only special thing here is that it will assume that first commented line (comment char is '#') can hold the names of the fields. You must give as many text fields as the real data fields below, or else an error will be generated. If no names are given, then the data fields will be unnamed.

The type of the fields will be automatically recognized from content, using a very basic algorithm: if the first character is a digit, then it will considered as numerical, else its a string. (see DATAFILE_INFO::P_GetFileInfo_csv() ).

Delimitor for CSV files can be ',' (comma) but it is frequently a semi-colon (';'). User can select this using the static function DATAFILE_INFO::SetCSVDelim(), default is semi-colon.

See http://en.wikipedia.org/wiki/Comma-separated_values for more info on CSV.

Retrieving information from a data file

The class DATAFILE_INFO can hold all the information about a data file. Say you have a data file, named "myfile.csv" (could be .arff type as well). You will be able to get all the needed information with the following code:

        DATAFILE_INFO dfi( "myfile.csv" );
        dfi.GetFileInfo();

The constructor doesn't actually do anything, and the DATAFILE_INFO::GetFileInfo() reads the file, and stores all the useful information about it. However, it does not store any data in memory. This needs to be done afterwards.

In case of failure (if it can't find or read the file), it will throw an exception, so it might be better to do this, for example:

        DATAFILE_INFO dfi( "myfile.csv" );
        try
        {
                dfi.GetFileInfo();
        }
        catch(...)
        {
                cout << "Error, cannot read file\n";
                exit(1);
        }

Once you have fetched this information, you can print the whole thing in a FILE object:

        dfi.Print( stdout ); // on standard output

Or you can get individual information, as in the following example:

        cout << " -Nb of fields: "         << dfi.GetTotNbFields()     << endl;
        cout << " -Nb of numeric fields: " << dfi.GetNbNumericFields() << endl;
        cout << " -Nb of string fields: "  << dfi.GetNbStringFields()  << endl;
        cout << " -Nb of datapoints: "     << dfi.GetNbDataPts()       << endl;
        cout << " -Has attribute names: "  << dfi.HasAttribNames()     << endl;

You can get information about the file itself:

        cout << " -File name    : " << dfi.GetFileName() << endl;
        cout << " -Type of file : " << GetString( dfi.GetFileType() ) << endl;

Warning:: Trying to call one of these function on a DATAFILE_INFO object that hasn't been filled (with a call to GetFileInfo() ) will trigger a fatal error.

If the attributes are named (this is always the case with ARFF files, but not mandatory with CSV files), you can fetch the fields name with:

for( size_t i=0; i<dfi.GetTotNbFields(); i++ )
        cout << "field " << i << ": name = " << dfi.GetAttribName(i) << endl;

If called on a CSV file with unnamed attributes, it will simply return an empty string.

Reading all the data

Once this has been done, all the data (all lines and all columns) can be read by the following code:

        DATAFILE_INFO dfi( "my_data_file.csv" );
        dfi.GetFileInfo();
        DATA_SET dataset;
        dataset.ReadData( dfi );

If some I/O error occurs, then the "ReadData()" function will throw an error, and an error message will be written in log file.

Dataset description

The user application has the ability to provide a description of the data columns it needs. This is a very common situations, where the datafile holds a bunch of columns, but only a few of these have some interest. This is handled with the class DATA_DESCR.

Initialisation

For example, say you have a dataset made of 8 columns, and where you are only interested in the third, fourth and sixth columns as input values, and the ouput value stands in second column. You can provide this information to an instance of class DATA_DESCR:

        DATA_DESCR desc( "2" /* output column */, "3;4;6" /* input columns */ );

This can be done also after allocating:

        DATA_DESCR desc;
                ... some lines of code
        desc.SetOutputColumn( 3 );  // or desc.SetOutputColumn( "3" );
        desc.SetInputColumns( "1;4;19;2" );

or also this way:

        DATA_DESCR desc;
                ... some lines of code
        desc.SetOutputColumn( 3 );
        desc.AddInputColumns( 1 ); // or desc.AddInputColumns( "1" );
        desc.AddInputColumns( "4" );
        desc.AddInputColumns( 19 );
        desc.AddInputColumns( "2" );

Remarks:: The default value for the output column is 1. If no input columns have been assigned, then the object will be in an invalid state and will throw an error at first attempt to use it.

You can always check the values with:

        desc.Print( stdout ); // on standard output

Or do it this way:

        cout << " - output index = " << desc.GetOutputIndex() << endl;
        for( int i=0; i<desc.GetNbInputs(); i++ )
                cout << " - input index " << i << " = " << desc.GetInputIndex( i ) << endl;

Warning:: A stated above, the last line will throw an error if no input indexes have been assigned.

After assigning output and input columns, user must be sure that no overlap remains in indexes or else an error will be triggered at first usage of this object. Overlap is defined as having output column included in input columns.

No overlap checking is done by the functions SetOutputColumn(), SetInputColumns(), and AddInputColumn() (except for the last two, that check that a column isn't given twice), so we can proceed with initialisation without being bothered by useless error messages.

Usage

This information can be assigned:

to a previously loaded dataset, or
to a DATAFILE_INFO object, that will be later used to load some data.

The second situation will be detailed in next section, lets talk first about the first case. For example, lets say you have loaded a whole dataset, but you want to try different columns as output value.

        DATA_DESCR desc( "2", "3;4;6" );
        dataset.AssignDescription ( desc );

Warning:: This function will fail and throw an error if the columns do not exist in the dataset

Once the description has been assigned to the dataset, this one behaves as if it had only four columns, for example this code:

        cout << "- nb of fields: " << dataset.GetNbFields() << endl;
        cout << "- output index: " << dataset.GetOutputIndex() << endl;

will produce the following output:

        - nb of fields: 4
        - output index: 2

Remarks:: A dataset object will always have a description available, if you don't assign one yourself, a default description is generated. It will state that the ouput values are in the first column, and it will generate input indexes as required.

Loading of selected columns

This is useful in case you know in advance that some columns are useless, so you don't want to load these in memory. Only the requested columns will be loaded in the dataset.

For example, say you are only interested in the third, fourth and sixth columns of the file as input values, and that the ouput value stands in second column. You can provide this information to an instance of class DATA_DESCR, and assign it to the DATAFILE_INFO object before reading the data:

        DATAFILE_INFO dfi( "mydatafile.csv" );
        dfi.GetFileInfo();
        DATA_DESCR desc( "2", "3;4;6" );
        dfi.AssignDescription( desc );
        DATA_SET dataset;
        dataset.ReadData( dfi );

This will only load the 4 requested columns from each line of data, whatever the total number of fields that the data file has.

Please note that the description of dataset will be automatically copied into dataset and adjusted after reading, so that:

the number of fields will be equal to the number of requested columns for inputs, plus 1 for the output
the ouput column will end up as first data field of data set.
the indexes will be adjusted.

For example, with the previous code, the following code:

        DATA_DESCR d = dataset.GetDescription();
        cout << "output index="       << d.GetOutputIndex() << endl;
        cout << "first input index="  << d.GetInputIndex(0) << endl;
        cout << "second input index=" << d.GetInputIndex(1) << endl;

will produce the following output:

output index=0
first input index=1
second input index=2

Statistical information of a data set

Once you have loaded a data set, you can fetch statistical information about it, using the class DATASET_PROPERTIES. For instance, you can print out the whole thing with:

dataset.GetProperties().Print( stdout );

If you want information about a specific column of the dataset, like this:

cout << "min value of first column = " << dataset.GetProperties().GetMinValue(0) << endl;
cout << "max value of third column = " << dataset.GetProperties().GetMaxValue(2) << endl;

Please note that if the attribute is of string type, then min/max function will return 0.0.

If that dataset has an associated description, then you will be able to get input/output min/max values.

cout << " - min value of output column = " << dataset.GetMinOutValue() << endl;
cout << " - max value of output column = " << dataset.GetMaxOutValue() << endl;

cout << " - min value of first input = "  << dataset.GetMinInValue(0) << endl;
cout << " - max value of first input = "  << dataset.GetMaxInValue(0) << endl;
cout << " - min value of second input = " << dataset.GetMinInValue(1) << endl;
cout << " - max value of second input = " << dataset.GetMaxInValue(1) << endl;

If there is no associated description, then these functions will throw an error.

Loading points one by one

If the size of the data file is quite important, you might not want to load it all in memory. You can load points one by one by providing them a correctly initialized DATAFILE_INFO object. This is illustrated in the following example:

        DATAFILE_INFO dfi( "mydatafile.csv" );
        dfi.GetFileInfo();

        try
        {
                dfi.OpenFile();
        }
        catch( ... )
        {
                cerr << "unable to open file\n"; exit(1);
        }

        do
        {
                DATA_POINT dpt;
                EN_READ_LINE_STATUS status = dpt.ReadDataFields( dfi );
                ... process point dpt;
        }
        while(  dfi.FileIsGood() );
        dfi.CloseFile();

The slifis::EN_READ_LINE_STATUS provides information about what was actually read, and its value can be checked.

Accessing data

Extracting data points.

Once a full dataset has been loaded into memory, the user app will need to access the datapoints, and to extract scalar values from these.

Accessing datapoints from a dataset is done with the following code:

        DATASET dataset;
         ... read data
        for( size_t i=0; i<dataset.GetNbPoints(); i++ )
                const DATAPOINT& dp = dataset.GetDataPoint( i );

Once you have a datapoint, you can access the values it contains by its index. The values are returned through a VALUE_PTR object, which is actually a typedef over a shared pointer on a VALUE object. This is needed to enable polymorphism, as values can be numerical or string-type.

        DATAPOINT dp;
          ... fill with some values
        VALUE_PTR out = GetOutputValue();
        for( size_t j=0; j<dataset.GetNbFields(); j++ )
        {
                VALUE_PTR ptr = GetInputValue(j);
        }

The numerical or string values can finally be extracted with the provided accessors:

        ...
        VALUE_PTR ptr = GetInputValue(j);
        std::string s = ptr.GetString();
        double v      = ptr.GetFloat();

If of wrong type, then the corresponding value will be empty for the string, or 0.0 for the numerical value.

Shortcuts

Some shortcuts are provided, for example, you can do the following:

TO BE CONTINUED ***

Editing a dataset

Besides loading data from a file and fetching the values, the API provides additional functions to add values to a data set.

See DATA_SET::AddData()

TO BE CONTINUED ***

Handling of strings

Strings in datasets can be handled in two differents ways, to increase performance. The basic way of doing is that a data set owns directly a std::string value, for all the datapoints. This is memory-inefficient because usually, string attributes in datasets hold a limited number of different values. So the class DATA_SET provides an index-based string storing facility: all the different possible string values are stored apart from the data points, the latter only holding an index to recover the original value.

Related classes:

NAVIGATION

Summary : Users guide
Previous page : File Storage
Top of page : Dataset handling
Next page : Robustness and error handling