Data Preprocessing

So far, we have been working with synthetic data that arrived in ready-made tensors. However, to apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocess it to suit our needs. Fortunately, Perl as a language can do much of the heavy lifting.

This section, while no substitute for knowing proper Perl will give you a crash course on some of the most common routines. Along with Perl, PDL itself has some user-friendly routines to do file operations that can be very useful.

These routines are part of the PDL::IO high level package that comes pre-installed with PDL.

Reading the Dataset

Comma-separated values (CSV) files are ubiquitous for the storing of tabular (spreadsheet-like) data. In them, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,field of gravitational physics”. To demonstrate how to load CSV files with PDL::IO::*, we create a CSV file below ./pdl_data/house_tiny.csv. This file represents a dataset of homes, where each row corresponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the roof type (RoofType), and the price (Price).

We use the Text::CSV_XS module to read and write a CSV file correctly, especially if your columns have arbitrary text in them. This module was installed as part of the Installation list of pre-requisites.

pdl> use Cwd qw(getcwd abs_path)
pdl> use Text::CSV_XS qw(csv)
pdl> mkdir './pdl_data',0755
pdl> print abs_path './pdl_data'
/home/myuser/pdl_data
pdl> $content = [[qw(NumRooms RoofType Price)], [undef, undef, 127500], [2,
undef, 106000], [4, 'Slate', 178100], [undef, undef, 140000]]
pdl> csv(in => $content, out => "./pdl_data/house_tiny.csv", encoding => "UTF-8") 

Let’s check if the file was written correctly. In a regular shell like bash do the following:

$ cat ./pdl_data/house_tiny.csv
NumRooms,RoofType,Price 
,,127500 
2,,106000 
4,Slate,178100 
,,140000

Now let’s load the dataset with csv again.

pdl> use Text::CSV_XS qw(csv)
pdl> use Data::Dumper
pdl> $incontent = csv(in => "./pdl_data/house_tiny.csv", blank_is_undef => 1,
empty_is_undef => 1)
pdl> print Dumper($incontent)
$VAR1 = [
          [
            'NumRooms',
            'RoofType',
            'Price'
          ],
          [
            undef,
            undef,
            '127500'
          ],
          [
            '2',
            undef,
            '106000'
          ],
          [
            '4',
            'Slate',
            '178100'
          ],
          [
            undef,
            undef,
            '140000'
          ]
        ];

If we want to directly read each numerical column into PDL objects, and the string columns into Perl arrays, we can use rcols from PDL::IO::Misc as shown below, where we ignore the header line, hence setting LINES => '1:' and set the Perl array column to be column 1 which is the RoofType column.

pdl> ($nrooms,$roof,$price) = rcols './pdl_data/house_tiny.csv', { LINES => '1:', COLSEP => ',', PERLCOLS=>[1], TYPES => [ushort,double] }
Reading data into ndarrays of type: [ Ushort Double ]
Read in  4  elements.
pdl> print $nrooms
[0 2 4 0]
pdl> print Dumper($roof)
$VAR1 = [
          undef,
          undef,
          'Slate',
          undef
        ];
pdl> print $price
[127500 106000 178100 140000]

Another option is to use rcsv1D from PDL::IO::CSV. This function cannot read string columns. However, it converts the empty or missing values into BAD values in PDL which are similar to NaN, unlike the rcols function which converts anything that is not a number to 0.

pdl> ($nrooms,$price) = rcsv1D './pdl_data/house_tiny.csv', [0,2], {type => [short,double], empty2bad => 1, header => 1}
pdl> print $nrooms
[BAD 2 4 BAD]
pdl> print $price
[127500 106000 178100 140000]

Now that we have seen how to use the rcols and rcsv1D functions to load columns into PDL objects or piddles, we will do the following to convert the data into a Data::Frame object.

To convert PDL objects into a Data::Frame object we can do the following:

pdl> ($nrooms,$roof,$price) = rcols './pdl_data/house_tiny.csv', { LINES => '1:', COLSEP => ',', PERLCOLS=>[1], TYPES => [ushort,double] }
pdl> $nrooms = $nrooms->setbadif($nrooms == 0)
pdl> print $nrooms
[BAD 2 4 BAD]
pdl> use Data::Frame
pdl> $df = Data::Frame->new(columns => [ 'NumRooms' => $nrooms, RoofType => $roof, Price => $price ])
pdl> print $df
-------------------------------
    NumRooms  RoofType  Price  
-------------------------------
 0  BAD                 127500 
 1  2                   106000 
 2  4         Slate     178100 
 3  BAD                 140000 
-------------------------------

Data::Frame comes with it’s own CSV file reader that does all of the above in a single call:

pdl> use Data::Frame
pdl> $df = Data::Frame->from_csv('./pdl_data/house_tiny.csv')
pdl> print $df
-------------------------------
    NumRooms  RoofType  Price  
-------------------------------
 0  BAD                 127500 
 1  2                   106000 
 2  4         Slate     178100 
 3  BAD                 140000 
-------------------------------

Data Preparation

In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to separate out columns corresponding to input versus target values. We can select columns either by name or via integer-location based indexing using the at function.

You might have noticed that Data::Frame automatically converts empty or blank values to BAD if the column is numeric, and leaves it as empty or blank if the column is of string type. Data::Frame creates a PDL::SV object for the string columns and a regular PDL object for the numerical ones.

BAD values are described here so we will not be describing it here, but PDL allows invalid values to be set to BAD which is different from NaN (not a number) which could be a valid value in the dataset.

Missing values or invalid values are important in data science, and must be handled correctly. Depending upon the context, missing values might be handled either via imputation or deletion. Imputation replaces missing values with estimates of their values while deletion simply discards either those rows or those columns that contain missing values.

Here are some common imputation heuristics. For categorical input fields, a blank or empty value can be treated as an unknown category. Since the RoofType column takes values Slate and blank, Data::Frame can convert this column into two columns RoofTypeIsSlate and RoofTypeIsUnknown. A row whose roof type is Slate will set values of RoofType_Slate and RoofType_unknown to 1 and 0, respectively. The converse holds for a row with a missing RoofType value.

pdl> $roof = $df->at("RoofType")->unpdl
pdl> print Dumper($roof)
$VAR1 = [
          '',
          '',
          'Slate',
          ''
        ];
pdl> $roof_is_slate = pdl(map { $_ eq 'Slate' ? 1 : 0 } @$roof)
pdl> print $roof_is_slate
[0 0 1 0]

### we can add this as a column now
pdl> $df->add_column("RoofTypeIsSlate", $roof_is_slate)
pdl> print $df
--------------------------------------------
    NumRooms  RoofType  Price   RoofTypeIsSlate 
--------------------------------------------
 0  BAD                 127500  0           
 1  2                   106000  0           
 2  4         Slate     178100  1           
 3  BAD                 140000  0           
--------------------------------------------

## we can also add a column to track unknown roof types
pdl> $df->add_column("RoofTypeIsUnknown", pdl(map { $_ eq '' ? 1 : 0 } @$roof))
pdl> print $df
-----------------------------------------------------------
    NumRooms  RoofType  Price   RoofTypeIsSlate  RoofTypeIsUnknown 
-----------------------------------------------------------
 0  BAD                 127500  0            1             
 1  2                   106000  0            1             
 2  4         Slate     178100  1            0             
 3  BAD                 140000  0            1             
-----------------------------------------------------------

For missing numerical values, one common heuristic is to replace the BAD or missing entries with the mean value of the corresponding column.

pdl> print $df->at("NumRooms")->avg
3
pdl> $df->at("NumRooms")->inplace->setbadtoval(3)
pdl> print $df
-----------------------------------------------------------
    NumRooms  RoofType  Price   RoofTypeIsSlate  RoofTypeIsUnknown 
-----------------------------------------------------------
 0  3                   127500  0            1             
 1  2                   106000  0            1             
 2  4         Slate     178100  1            0             
 3  3                   140000  0            1             
-----------------------------------------------------------

Let’s now remove the RoofType column for making the calculations easier.

pdl> $df->delete("RoofType")
pdl> print $df
-------------------------------------------------
    NumRooms  Price   RoofIsSlate  RoofIsUnknown 
-------------------------------------------------
 0  3         127500  0            1             
 1  2         106000  0            1             
 2  4         178100  1            0             
 3  3         140000  0            1             
-------------------------------------------------
pdl> $targets = $df->select_columns('Price')
pdl> print $targets
-----------
    Price  
-----------
 0  127500 
 1  106000 
 2  178100 
 3  140000 
-----------
pdl> $inputs = $df->select_columns(qw(NumRooms RoofIsSlate RoofIsUnknown))
pdl> print $inputs
-----------------------------------------
    NumRooms  RoofIsSlate  RoofIsUnknown 
-----------------------------------------
 0  3         0            1             
 1  2         0            1             
 2  4         1            0             
 3  3         0            1             
-----------------------------------------

Conversion to the Tensor Format

Now that all the entries in $inputs and $targets are numerical, we can load them into a tensor (recall).

pdl> $y = $targets->at(0)
pdl> print $y
[127500 106000 178100 140000]
pdl> $x = pdl($inputs->column("NumRooms"), $inputs->column("RoofIsSlate"), $inputs->column("RoofIsUnknown"))->transpose
pdl> print $x

[
 [3 0 1]
 [2 0 1]
 [4 1 0]
 [3 0 1]
]

Discussion

You now know how to partition data columns, impute missing variables, and load data into tensors using tools available in Perl, PDL and Data::Frame. In future sections, you will pick up some more data processing skills. While this crash course kept things simple, data processing can get hairy. For example, rather than arriving in a single CSV file, our dataset might be spread across multiple files extracted from a relational database. For instance, in an e-commerce application, customer addresses might live in one table and purchase data in another. Moreover, practitioners face myriad data types beyond categorical and numeric, for example, text strings, images, audio data, and point clouds. Oftentimes, advanced tools and efficient algorithms are required in order to prevent data processing from becoming the biggest bottleneck in the machine learning pipeline. These problems will arise when we get to computer vision and natural language processing.

Finally, we must pay attention to data quality. Real-world datasets are often plagued by outliers, faulty measurements from sensors, and recording errors, which must be addressed before feeding the data into any model. We will need data visualization tools such as PDL::Graphics::GnuPlot and HighCharts to inspect data and develop intuitions about the type of problems you may need to address.

Exercises

Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?
How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What happens if you try it out on a server?
How would you deal with data that has a very large number of categories? What if the category labels are all unique? Should you include the latter?

Next - Linear Algebra

Previous - Data Manipulation

d2l-pdl