preprocess.tex

\chapter{Data preprocessing}
\label{chap:preprocess}
\glsresetall

\chapterprecishere{I find your lack of faith disturbing.
  \par\raggedleft--- \textup{Darth Vader}, Star Wars: Episode IV -- A New Hope (1977)}

In this chapter, we discuss the data preprocessing, which is the process of adjusting the
data to make it suitable for a particular learning machine or, at the least, to ease the
learning process.

Similarly to data handling, data preprocessing is done by applying a series of operations
to the data.  However, some of the parameters of the operations are not fixed but rather
are fit from a data sampling.  In the context of inductive learning, the sampling is the
training set.

The operations are dependent on the chosen learning method.  So, when planning the
solution in our project, we must consider the preprocessing tasks that are necessary to
make the data suitable for the chosen methods.

I present the most common data preprocessing tasks in three categories: data cleaning,
data sampling, and data transformation.  For each task, I discuss the behavior of the data
preprocessing techniques in terms of fitting, adjustment of the training set, and
application of the preprocessor in production.

Finally, I discuss the importance of the default behavior of the model when the
preprocessing chain degenerates over a sample, i.e. when the preprocessor decides that it
has no strategy to adjust the data to make it suitable for the model.

\begin{mainbox}{Chapter remarks}

  \boxsubtitle{Contents}

  \startcontents[chapters]
  \printcontents[chapters]{}{1}{}
  \vspace{1em}

  \boxsubtitle{Context}

  \begin{itemize}
    \itemsep0em
    \item Tidy data is not necessarily suitable for modeling.
    \item Parameters of the preprocessor are fitted rather than being fixed.
  \end{itemize}

  \boxsubtitle{Objectives}

  \begin{itemize}
    \itemsep0em
    \item Understand the main data preprocessing tasks and techniques.
    \item Learn the behavior of the preprocessing chain in terms of fitting, adjustment,
      and application.
  \end{itemize}

  \boxsubtitle{Takeaways}

  \begin{itemize}
    \itemsep0em
    \item Each learning method requires specific data preprocessing tasks.
    \item Fitting the preprocessor is crucial to avoid leakage.
    \item Default behavior of the model when the preprocessing chain degenerates must be
      specified.
  \end{itemize}
\end{mainbox}

{}
\clearpage

\section{Introduction}

In \cref{chap:data,chap:handling}, we discussed data semantics and the tools to
handle data.  They provide the grounds for preparation of the data as we described in the
data sprint tasks in \cref{sub:workflow}.  However, the focus is to guarantee that the
data is tidy and in the observational unit of interest, not to prepare it for modeling.

As a result, although data might be appropriate for the learning tasks we described in
\cref{chap:slt} --- in the sense that we know what the feature vectors and the target
variable are ---, they might not be suitable for the machine learning methods we will use.

One simple example is the perceptron (\cref{sub:perceptron}) that assumes all
input variables are real numbers.  If the data contains categorical variables, we must
convert them to numerical variables before applying the perceptron.

For this reason, the solution sprint tasks in \cref{sub:workflow} include not only the
learning tasks but also the \emph{data preprocessing} tasks, which are dependent on the
chosen machine learning methods.

\begin{defbox}{Data preprocessing}{preprocessing}
  The process of adjusting the data to make it suitable for a particular learning machine
  or, at the least, to ease the learning process.
\end{defbox}

This is done by applying a series of operations to the data, like in data handling.  The
difference here is that some of the parameters of the operations are not fixed; rather, they
are fit from a data sampling.  Once fitted, the operations can be applied to
new data, sample by sample.

As a result, a data processing technique acts in three steps:
\begin{enumerate}
  \itemsep0em
  \item \textbf{Fitting}: The parameters of the operation are adjusted to the training
    data (which has already been integrated and tidied, represents well the phenomenon of
    interest, and each sample is in the correct observational unit);
  \item \textbf{Adjustment}: The training data is adjusted according to the fitted
    parameters, eventually changing the sampling size and distribution;
  \item \textbf{Applying}: The operation is applied to new data, sample by sample.
\end{enumerate}

Understanding these steps and correctly defining the behavior of each of them is crucial
to avoid \gls{leakage} and to guarantee that the model will behave as expected in
production.

\subsection{Formal definition}
\label{sub:formal-preprocessing}

Let $T = (K, H, c)$ be a table that represents the data in the desired observational unit
--- as defined in \cref{sec:formal-structured-data}.  In this chapter, without loss of
generality --- as the keys are not used in the modeling process ---, we can consider $K =
\{1, 2, \dots\}$ such that $\rowcard[i] = 0$ if, and only if, $i > n$.  That means that
every row $r \in \{1, \dots, n\}$ is present in the table.

A data preprocessing strategy $F$ is a function that takes a table $T = (K, H, c)$ and
returns an adjusted table $T' = (K', H', c')$ and a fitted \emph{preprocessor} $f(z; \phi)
\equiv f_\phi(z)$ such that $$z \in \bigtimes_{h\, \in\, H} \domainof{h} \cup \{?\}$$ and $\phi$ are
the fitted parameters of the operation.  Similarly, $z' = f_\phi(z)$, called the
preprocessed tuple, satisfies $$z' \in \bigtimes_{h'\, \in\, H'} \domainof{h'} \cup
\{?\}\text{.}$$ Note that we make no restrictions on the number of rows in the adjusted
table, i.e., preprocessing techniques can change the number of rows in the training table.

In practice, strategy $F$ is a chain of dependent preprocessing operations $F_1$, \dots,
$F_m$ such that, given $T = T^{(0)}$, each operation $F_i$ is applied to the table
$T^{(i-1)}$ to obtain $T^{(i)}$ and the fitted preprocessor $f_{\phi_i}$.  Thus, $T' =
T^{(m)}$ and $$f(z; \phi = \{\phi_1, \dots, \phi_m\}) = \left(f_{\phi_1} \circ \dots \circ
f_{\phi_m}\right)(z)\text{,}$$ where $\circ$ is the composition operator.  I say that
they are dependent since none of the operations can be applied to the table without the
previous ones.

\subsection{Degeneration}

The objective of the fitted preprocessor is to adjust the data to make it suitable for the
model.  However, sometimes it cannot achieve this goal for a particular input $z$.  This
can happen for many reasons, such as unexpected values, information ``too incomplete'' to
make a prediction, etc.

Formally, we say that the preprocessor $f_\phi$ degenerates over tuple $z$ if it outputs
$z' = f_\phi(z)$ such that $z' = (?, \dots, ?)$.  In practice, that means that the
preprocessor decided that it has no strategy to adjust the data to make it suitable for
the model.  For the sake of simplicity, if any step $f_{\phi_i}$ degenerates over
tuple $z^{(i)}$, the whole preprocessing chain degenerates\footnote{Usually, this is
implemented as an exception or similar programming mechanism.} over $z = z^{(0)}$.

Consequently, in the implementation of the solution, the developer must choose a default
behavior for the model when the preprocessing chain degenerates over a tuple.  It can
be as simple as returning a default value or as complex as redirecting the tuple to a
different pair of preprocessor and model.  Sometimes, the developer can choose to
integrate this as an error or warning in the user application.

\subsection{Data preprocessing tasks}

The most common data preprocessing tasks can be divided into three categories:
\begin{itemize}
  \itemsep0em
  \item Data cleaning;
  \item Data sampling; and
  \item Data transformation.
\end{itemize}

In the next sections, I address some of the most common data preprocessing tasks
in each of these categories.  I present them in the order they are usually applied in the
preprocessing, but note that the order is not fixed and can be changed according to the
needs of the problem.

\section{Data cleaning}

Data cleaning is the process of removing errors and inconsistencies from the data.  This is
usually done to make the data more reliable for training and to avoid bias in the learning
process.  Usually, such errors and inconsistencies make the learning machines ``confused''
and can lead to poor performance models.

Also, it includes the process of dealing with missing information, which most machine
learning methods do not cope with.  Solutions range from the simple removal of the
observations with missing data to the creation of new information to encode the missing data.

\subsection{Treating inconsistent data}

% TODO: move this somewhere when we talk about data handling and/or tidying
% Sometimes, during data collection, information is recorded using special codes.  For
% instance, the value 9999 might be used to indicate that the data is missing.  Such codes
% must be replaced with more appropriate values before modeling.  If a single variable
% encodes more than one concept, new variables must be created to represent each concept.

There are a few, but important, tasks to be done during data preprocessing in terms of
invalid and inconsistent data --- note that we assume that most of the issues in terms of
the semantics of the data have been solved in the data handling phase.  Especially in
production, the developer must be aware of the behavior of the model when it faces
information that is not supposed to be present in the data.

One of the tasks is to ensure that physical quantities are dealt with standard units.  One must
check whether all columns that store physical quantities have the same unit of
measurement.  If not, one must convert the values to the same unit.  A summary of this
preprocessing task is presented in \cref{tab:unit-conversion}.

\begin{tablebox}[label=tab:unit-conversion]{Unit conversion preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Unit conversion}} \\
    \midrule
    % \textbf{Requirements} &
    %   A variable with the physical quantity and a variable with the unit of measurement. \\
    \textbf{Goal} &
      Convert physical quantities into the same unit of measurement. \\
    \textbf{Fitting} &
      None. User must declare the units to be used and, if appropriate, the conversion
      factors. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor converts the numerical values and drops the unit of measurement column.  \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Moreover, if one knows that a variable must follow a specific range of values, we can check
whether the values are within this range.  If not, one must replace the values with
missing data or with the closest valid value.  Alternatively, one can discard the
observation based on that criterion.  Consult \cref{tab:range-check} for a summary of this
operation.

\begin{tablebox}[label=tab:range-check]{Range check preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Range check}} \\
    \midrule
    % \textbf{Requirements} &
    %   A numerical variable. \\
    \textbf{Goal} &
      Check whether the values are within the expected range. \\
    \textbf{Fitting} &
      None. User must declare the valid range of values. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently.  If appropriate,
      degenerated samples are removed. \\
    \textbf{Applying} &
      Preprocessor checks whether the value $x$ of a variable is within the range $[a,
      b]$.  If not, it replaces $x$ with: (a) missing value $?$, (b) the closest valid
      value $\max(a, \min(b, x))$, or (c) degenerates (discards the observation). \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Another common problem in inconsistent information is that the same category might be
represented by different strings.  This is usually done by creating a dictionary that maps
the different names to a single one, using standardizing lower or upper case, removing
special characters, or more advanced fuzzy matching techniques --- see
\cref{tab:text-standardization}.

\begin{tablebox}[label=tab:text-standardization]{Category standardization preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Category standardize}} \\
    \midrule
    % \textbf{Requirements} &
    %   A categorical variable. \\
    \textbf{Goal} &
      Create a dictionary and/or function to map different names to a single one. \\
    \textbf{Fitting} &
      None. User must declare the mapping. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor replaces the categorical variable $x$ with the mapped
      value $f(x)$ that implements case standardization, special character removal, and/or
      dictionary fuzzy matching. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Note that these technique parameters are not fitted from the data, but rather are fixed
from the problem definition.  As a result, they could be done in the data handling phase.
The reason we put them here is that the new data in production usually come with the
same issues.  Having the fixes programmed into the preprocessor makes it easier to
guarantee that the model will behave as expected in production.

\subsection{Outlier detection}

Outliers are observations that are significantly different from the other observations.
They can be caused by errors or by the presence of different phenomena mixed in the data
collection process.  In both cases, it is important to deal with outliers before modeling.

The standard way to deal with outliers is to remove them from the dataset.  Assuming that
the errors or the out-of-distribution data appear randomly and rarely, this is a good
strategy.

Another approach is dealing with each variable independently.  This way, one can replace
the outlier value with missing data.  There are many ways to detect outlier values, but
the simplest one is probably a heuristic based on the \gls{iqr}.

Let $Q_1$ and $Q_3$ be the first and the third quartiles of the values in a variable,
respectively.  The \gls{iqr} is defined as $Q_3 - Q_1$.  The values that are less than
$Q_1 - 1.5\, \text{IQR}$ or greater than $Q_3 + 1.5\, \text{IQR}$ are considered outliers.
See \cref{tab:iqr-outlier}.

\begin{tablebox}[label=tab:iqr-outlier]{Outlier detection using the interquartile range.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Outlier detection using the IQR}} \\
    \midrule
    % \textbf{Requirements} &
    %   A numerical variable. \\
    \textbf{Goal} &
      Detect outliers using the IQR. \\
    \textbf{Fitting} &
      Store the values of $Q_1$ and $Q_3$ for each variable. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor replaces the outlier values with missing data. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

More sophisticated methods can be used to detect samples that are outliers, such as using
the definition of an outlier in the DBSCAN\footfullcite{Ester1996}. But, this is not
enough to fit the parameters of the preprocessor.  The reason is that descriptive methods
like DBSCAN  --- in this case, a method for clustering --- do not generalize to new data.
I suggest using methods like One-Class SVM\footfullcite{Scholkopf2001} to fit the
parameters of the preprocessor that detects outliers.  Thus, any new data point can
be classified as an outlier or not.

Like filtering operations in the pipeline, the developer must specify a default behavior
for the model when an outlier sample is detected in production.  See
\cref{tab:outlier-removal}.

\begin{tablebox}[label=tab:outlier-removal]{Task of filtering outliers.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Outlier removal}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with outliers. \\
    \textbf{Goal} &
      Remove the observations that are outliers. \\
    \textbf{Fitting} &
      Parameters of the outlier classifier. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently, removing
      degenerated samples. \\
    \textbf{Applying} &
      Preprocessor degenerates if the sample is classified as an outlier and does
      nothing, otherwise. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

\subsection{Treating missing data}

Since most models cannot handle missing data, it is crucial to deal with it in the data
preprocessing.

There are four main strategies to deal with missing data:
\begin{itemize}
  \itemsep0em
  \item Remove the observations (rows) with missing data;
  \item Remove the variables (columns) with missing data;
  \item Just impute the missing data;
  \item Use an indicator variable to mark the missing data and impute it.
\end{itemize}

Removing rows and columns are commonly used when the number of missing data is small
compared to the total number of rows or columns.  However, be aware that removing rows
``on demand'' can
artificially change the data distribution, especially when the missing data is not missing at
random.  Row removal suffers from the same problem as any filtering operations
(degeneration) in the preprocessing step; the developer must specify a default behavior
for the model when a row is discarded in production.  See \cref{tab:row-removal-missing}.

\begin{tablebox}[label=tab:row-removal-missing]{Task of filtering rows based on missing data.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Row removal based on missing data}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with missing data. \\
    \textbf{Goal} &
      Remove the observations with missing data in any (or some) variables. \\
    \textbf{Fitting} &
      None. Variables to look for missing data are declared beforehand. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently, removing
      degenerated samples. \\
    \textbf{Applying} &
      Preprocessor degenerates over the rows with missing data in the specified variables.
      \\
    \bottomrule
  \end{tabular}
\end{tablebox}

In the case of column removal, the
preprocessor just learns to drop the columns that have missing data during fitting.
Beware that valuable information might be lost when removing columns for all the samples.

\begin{tablebox}[label=tab:col-drop-missing]{Task of dropping columns based on missing data.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Column removal based on missing data}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with missing data. \\
    \textbf{Goal} &
      Remove the variables with missing data. \\
    \textbf{Fitting} &
      All variables with missing data in the training set are marked to be removed. \\
    \textbf{Adjustment} &
      Columns marked are dropped from the training set. \\
    \textbf{Applying} &
      Preprocessor drops the chosen columns in fitting. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Imputing the missing data is usually done by replacing the missing values with some
statistic of the available values in the column, such as the mean, the median, or the
mode\footnote{More sophisticated methods can be used, such as the k-nearest neighbors
algorithm, for example, consult \fullcite{Troyanskaya2001}.}.  This is a simple and
effective strategy, but it can introduce bias in the data, especially when the number of
samples with missing data is large.  See \cref{tab:imputation}.

\begin{tablebox}[label=tab:imputation]{Task of imputing missing data.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Imputation of missing data}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with missing data. \\
    \textbf{Goal} &
      Replace the missing data with a statistic of the available values. \\
    \textbf{Fitting} &
      The statistic is calculated from the available data in the training set. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor replaces the missing values with the chosen statistic. If an indicator
      variable is required, it is created and filled with the logical value:
      missing or not missing. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Just imputing data is not suitable when one is not sure whether the missing data is
missing because of a systematic error or phenomenon.  A model can learn the effect of the
underlying reason for missingness for the predictive task.
In that case, creating an indicator variable is a good strategy.  This is done by creating
a new column that contains a logical value indicating whether the data is missing or
not\footnote{Some kind of imputation is still needed, but we expect the model to deal
better with it since it can decide using both the indicator and the original variable.}.

Many times the indicator variable is already present in the data.  For instance, in a
dataset that contains information about pregnancy, let us say the number of days since
the last pregnancy.  This information will certainly be missing if sex is male
or the number of children is zero.  In this case, no new indicator variable is needed.
See \cref{tab:col-drop-missing}.

\section{Data sampling}

Once data is cleaned, the next step is (typically) to sample the data.  Sampling is the
process of selecting a random subset of the data or creating variations of the original
training set.

There are three main tasks that sample the data: subsampling, scope filtering, and class
balancing.

\subsection{Random sampling}
\label{sub:random-sampling}

Some machine learning methods are computationally expensive, and a smaller dataset might be
enough to solve the problem.  Random sampling is simply done by selecting a random subset
of the training data with a user-defined size.

However, note that the preprocessor for this task \emph{must never do anything with the
new data} (or the test set we discuss in \cref{chap:planning}).  See
\cref{tab:random-sampling}.

\begin{tablebox}[label=tab:random-sampling]{Task of random sampling.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Random sampling}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with the scope of the phenomenon. \\
    \textbf{Goal} &
      Select a random subset of the training data. \\
    \textbf{Fitting} &
      None. User must declare the size of the sample. \\
    \textbf{Adjustment} &
      Rows of the training set are randomly chosen. \\
    \textbf{Applying} &
      Pass-through: preprocessor does nothing with the new data. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

\subsection{Scope filtering}

Scope filtering is the process of reducing the scope of the phenomenon we want to model.
Like the filtering operation in the data handling pipeline (consult \cref{sub:filtering}),
the data scientists choose a set of predefined rules to filter the data.

Unlike outlier detection, we assume that the rule is fixed and known beforehand.  The
preprocessor degenerates over the samples that do not satisfy the rule.  A summary of the
task is presented in \cref{tab:scope-filtering}.

\begin{tablebox}[label=tab:scope-filtering]{Task of filtering the scope of the data.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Scope filtering}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with the scope of the phenomenon. \\
    \textbf{Goal} &
      Remove the observations that do not satisfy a predefined rule. \\
    \textbf{Fitting} &
      None. User must declare the rule. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently, removing degenerated
      samples. \\
    \textbf{Applying} &
      Preprocessor degenerates over the samples that do not satisfy the rule. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

An interesting variation is the model trees\footfullcite{Freek2015}.  They are shallow
decision trees that are used to filter the data.  At each leaf, a different model is
trained with the data that satisfies the rules that reach the leaf.  This is a good
strategy when the phenomenon is complex and can be divided into simpler subproblems.
In this case, the preprocessor does not degenerate over the samples, but rather the
preprocessing chain branches into different models (and potentially other preprocessing
steps).

\subsection{Class balancing}

Some data classification methods are heavily affected by the number of observations in each
class.  This is especially true for methods that learn the class priors directly from the
data, like the naïve Bayes classifier.

Two strategies are often used to balance the classes: oversampling and undersampling.  The
former is done by creating synthetic observations of the minority class.  The latter is
done by removing observations of the majority class.

Undersampling can be done by removing observations of the majority class randomly
(similarly to random sampling, \cref{sub:random-sampling}).  On the other hand,
oversampling can be done by creating synthetic observations of the minority class.  The
most common method is resampling\footnote{Sometimes called bootstrapping.}, which selects
a random subset of the data with replacement.  A drawback of this method is that it
produces repeated observations that contain no new information.

In any case, the preprocessor for this task \emph{must never do anything with the new
data} (or the test set we discuss in \cref{chap:planning}).  See
\cref{tab:class-balancing}.

\begin{tablebox}[label=tab:class-balancing]{Task of class balancing.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Class balancing}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with unbalanced classes. \\
    \textbf{Goal} &
      Balance the number of observations in each class. \\
    \textbf{Fitting} &
      None. User must declare the number of samples in each class. \\
    \textbf{Adjustment} &
      Rows of the training set are randomly chosen. \\
    \textbf{Applying} &
      Pass-through: preprocessor does nothing with the new data. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

More advanced sampling methods exist.  For instance, the SMOTE
algorithm\footfullcite{chawla2002smote} creates synthetic observations of the minority
class without repeating the same observations.

\section{Data transformation}

Another important task in data handling is data transformation.  This is the process of
adjusting the types of the data and the choice of variables to make it suitable for
modeling.

At this point, the data format is acceptable, i.e., each observation is in the correct
observational unit, there are no missing values, and the sampling is representative of the
phenomenon of interest.  Now, we can perform a series of operations to make the
column's types and values suitable for modeling.  The reason for this is that most
machine learning methods require the input variables to follow some restrictions.  For
instance, some methods require the input variables to be real numbers, others require the
input variables to be in a specific range, etc.

\subsection{Type conversion}

Type conversion is the process of changing the type of the values in the columns.  We do
so to make the input variables compatible with the machine learning methods we will use.

The most common type conversion is the conversion from categorical to numerical values.
Ideally, the possible values of a categorical variable are known beforehand.
For instance, given the values $x \in \{a, b, c\}$ in a column, there are two main ways to
convert them to numerical values: label encoding and one-hot encoding.  If there is a
natural order $a < b < c$, label encoding is usually sufficient.  Otherwise, one-hot encoding
can be used.

Label encoding is the process of replacing the values $x \in \{a, b, c\}$ with the values
$x' \in \{1, 2, 3\}$, where $x' = 1$ if $x = a$, $x' = 2$ if $x = b$, and $x' = 3$ if
$x = c$.  Other numerical values can be assigned depending on the specific problem.

One-hot encoding is the process of creating a new column for each possible value
of the categorical variable.  The new column is filled with the logical value $1$ if the
value is present and $0$ otherwise.

However, in the second case, the number of categories might be too large or might not be
known beforehand.  So, the preprocessing step must identify the unique values in the
column and create the new columns accordingly.  It is common to group the less frequent
values into a single column, called the \emph{other} column.  See \cref{tab:one-hot}.

\begin{tablebox}[label=tab:one-hot]{One-hot encoding preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{One-hot encoding}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with categorical variables. \\
    \textbf{Goal} &
      Create a new column for each possible value of the categorical variable. \\
    \textbf{Fitting} &
      Store the unique values of the categorical variable.  If appropriate, indicate
      the special category \emph{other}.  \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor creates a new column for each possible value of the categorical
      variable.  The new column is filled with the logical value $1$ if the old value
      matches the new column and $0$ otherwise.  If the value is new or among the less
      frequent values, it is assigned to the \emph{other} column.  \\
    \bottomrule
  \end{tabular}
\end{tablebox}

The other direction is also common: converting numerical values to categorical values.
This is usually done by binning the numerical variable, either by frequency or by range.
In both cases, the user declares the number of bins.  Binning by frequency is done by
finding the percentiles of the values and creating the bins accordingly.  Binning by
range is done by dividing the range of the values into equal parts, given the minimum and
maximum values.  See \cref{tab:binning}.

\begin{tablebox}[label=tab:binning]{Binning numerical values preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Binning numerical values}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with numerical variables. \\
    \textbf{Goal} &
      Create a new categorical column from a numerical one.  \\
    \textbf{Fitting} &
      Store the range of each bin. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor converts each numerical value to a categorical value by checking
      which bin the value falls into. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Another common task, although it receives less attention, is the conversion of dates (or
other interval variables) to numerical values.  Interval variables, like dates, have
almost no information in their absolute values.  However, the difference between two
dates can be very informative.  For example, the difference between the date of birth and
the date of the last purchase becomes the age of the customer.

\subsection{Normalization}

Normalization is the process of scaling the values in the columns.  This is usually done to
keep data within a specific range or to make different variables comparable.  For instance,
some machine
learning methods require the input variables to be in the range $[0, 1]$.

The most common normalization methods are standardization and rescaling.  The former is done
by subtracting the mean and dividing by the standard deviation of the values in the column.
The latter is performed so that the values are in a specific range, usually $[0, 1]$ or $[-1, 1]$.

Standardization works well when the values in the column are normally distributed.
It not only keeps the values in an expected range but also makes the data distribution
comparable with other variables.  Given that $\mu$ is the mean and $\sigma$ is the
standard deviation of the values in the column, the standardization is done by
\begin{equation}
  \label{eq:standardization}
  x' = \frac{x - \mu}{\sigma}\text{.}
\end{equation}
See \cref{tab:standardization}.

\begin{tablebox}[label=tab:standardization]{Standardization preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Standardization}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with numerical variables. \\
    \textbf{Goal} &
      Scale the values in a column. \\
    \textbf{Fitting} &
      Store the statistics of the variable: the mean and the standard deviation. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor scales the values according to \cref{eq:standardization}. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

In the case of rescaling, during production, the preprocessor usually clamps\footnote{The
operation $\clamp(x; a, b)$ where $a$ and $b$ are the lower and upper bounds,
respectively, is defined as $\max(a, \min(b, x))$.} the values after rescaling.  This is
done to avoid the model making predictions that are out of the range of the training
data.  Given that we want to rescale the values in the column to the range $[a, b]$, and
that $x_\text{min}$ and $x_\text{max}$ are the minimum and maximum values in the column,
the rescaling is done by
\begin{equation}
  \label{eq:rescaling}
  x' = a + \big(b - a\big) \, \frac{x - x_\text{min}}{x_\text{max} - x_\text{min}}\text{.}
\end{equation}
See \cref{tab:rescaling}.

\begin{tablebox}[label=tab:rescaling]{Rescaling preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Rescaling}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with numerical variables. \\
    \textbf{Goal} &
      Rescale the values in a column. \\
    \textbf{Fitting} &
      Store the appropriate statistics of the variable: the minimum and the maximum
      values. \\
    \textbf{Adjustment} &
      Training set is adjusted sample by sample, independently. \\
    \textbf{Applying} &
      Preprocessor scales the values according to \cref{eq:rescaling}. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

Related to normalization is the log transformation, which applies the logarithm to the
values in the column.  This is usually done to make the data distribution more symmetric
over the mean or to reduce the effect of outliers.

% TODO: figure with power law distribution and log transformation

\subsection{Dimensionality reduction}

Dimensionality reduction is the process of reducing the number of variables in the data.
It can identify irrelevant variables and reduce the complexity of the model (since there
are fewer variables to deal with).

% TODO: talk about curse of dimensionality in chapter stl,
%       maybe even showing how the linear regression problem must be over-specified
% The so-called \emph{curse of dimensionality} is a
% common problem in machine
% learning, where the number of variables is much larger than the number of observations.

There are two main types of dimensionality reduction algorithms: feature selection and
feature extraction.  The former selects a subset of the existing variables that leads
to the best models.  The latter creates new variables that are combinations
of the original ones.

% Feature selection can be performed before modeling (filter), together with the model
% search (wrapper), or as a part of the model itself (embedded).

One example of feature selection is ranking the variables by their mutual information with
the target variable and selecting the top $k$ variables.  Mutual information is a measure
of the amount of information that one variable gives about another.  So, it is expected
that variables with high mutual information with the target variable are more important
for the model.

Feature extraction uses either linear methods, such as \gls{pca}, or non-linear methods,
such as autoencoders.  These methods are able to compress the information in the training
data into a smaller number of variables.  Thus, the model can learn the solution in
a lower-dimensional space.  A drawback of this method is that the new variables are
hard to interpret, since they are combinations of the original variables.

\subsection{Data enhancement}

The ``opposite'' of dimensionality reduction is data enhancement.  This is the process of
bringing to the dataset external information that complements the existing data.  For
example, imagine that in the tidy data we have a column with the zip code of the
customers.  We can use this information to join (in this case, always a left join) a
dataset with social and economic information about the region of the zip code.

The preprocessor, then, stores the external dataset and the column to join the data.
During production, it enhances any new observation with the external information.  See
\cref{tab:data-enhancement}.

\begin{tablebox}[label=tab:data-enhancement]{Data enhancement preprocessing task.}
  \centering
  \rowcolors{2}{black!10!white}{}
  \begin{tabular}{lp{6cm}}
    \toprule
    \multicolumn{2}{c}{\textbf{Data enhancement}} \\
    \midrule
    % \textbf{Requirements} &
    %   A dataset with a column to join. \\
    \textbf{Goal} &
      Enhance the dataset with external information. \\
    \textbf{Fitting} &
      Store the external dataset and the column to join. \\
    \textbf{Adjustment} &
      Training set is left joined with the external dataset.  Because of the properties of
      the left join, the new dataset has the same number of rows as the original dataset,
      and it is equivalent to enhancing each row independently.  \\
    \textbf{Applying} &
      Preprocessor enhances the new data with the external information. \\
    \bottomrule
  \end{tabular}
\end{tablebox}

\subsection{Comments on unstructured data}

Any unstructured data can be transformed into structured data.  We can see this task as a
data preprocessing task.  Techniques like bag of words, word embeddings, and signal (or
image) processing can be seen as preprocessing techniques that transform unstructured data
into structured data, which is suitable for modeling.

Also, modern machine learning methods, like \glspl{cnn}, are simply models that learn both
the preprocessing and the model at the same time.  This is done by using convolutional
layers that learn the features of the data.  In digital signal processing, this is called
feature extraction.  The difference there is that the convolution filters are handcrafted,
while in \glspl{cnn} they are learned from the data.

The study of unstructured data is a vast field and is out of the scope of this book.  I
recommend \textcite{Jurafsky2008}\footnote{\fullcite{Jurafsky2008}. A new edition is
is under preparation and it is available for free: \fullcite{Jurafsky2024}.} for a complete
introduction to Natural Language Processing and
\textcite{Szeliski2022}\footfullcite{Szeliski2022} for a comprehensive introduction to
Computer Vision.

% vim: spell spelllang=en