-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathpreprocess.tex
875 lines (749 loc) · 39.5 KB
/
preprocess.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
\chapter{Data preprocessing}
\label{chap:preprocess}
\glsresetall
\chapterprecishere{I find your lack of faith disturbing.
\par\raggedleft--- \textup{Darth Vader}, Star Wars: Episode IV -- A New Hope (1977)}
In this chapter, we discuss the data preprocessing, which is the process of adjusting the
data to make it suitable for a particular learning machine or, at the least, to ease the
learning process.
Similarly to data handling, data preprocessing is done by applying a series of operations
to the data. However, some of the parameters of the operations are not fixed but rather
are fit from a data sampling. In the context of inductive learning, the sampling is the
training set.
The operations are dependent on the chosen learning method. So, when planning the
solution in our project, we must consider the preprocessing tasks that are necessary to
make the data suitable for the chosen methods.
I present the most common data preprocessing tasks in three categories: data cleaning,
data sampling, and data transformation. For each task, I discuss the behavior of the data
preprocessing techniques in terms of fitting, adjustment of the training set, and
application of the preprocessor in production.
Finally, I discuss the importance of the default behavior of the model when the
preprocessing chain degenerates over a sample, i.e. when the preprocessor decides that it
has no strategy to adjust the data to make it suitable for the model.
\begin{mainbox}{Chapter remarks}
\boxsubtitle{Contents}
\startcontents[chapters]
\printcontents[chapters]{}{1}{}
\vspace{1em}
\boxsubtitle{Context}
\begin{itemize}
\itemsep0em
\item Tidy data is not necessarily suitable for modeling.
\item Parameters of the preprocessor are fitted rather than being fixed.
\end{itemize}
\boxsubtitle{Objectives}
\begin{itemize}
\itemsep0em
\item Understand the main data preprocessing tasks and techniques.
\item Learn the behavior of the preprocessing chain in terms of fitting, adjustment,
and application.
\end{itemize}
\boxsubtitle{Takeaways}
\begin{itemize}
\itemsep0em
\item Each learning method requires specific data preprocessing tasks.
\item Fitting the preprocessor is crucial to avoid leakage.
\item Default behavior of the model when the preprocessing chain degenerates must be
specified.
\end{itemize}
\end{mainbox}
{}
\clearpage
\section{Introduction}
In \cref{chap:data,chap:handling}, we discussed data semantics and the tools to
handle data. They provide the grounds for preparation of the data as we described in the
data sprint tasks in \cref{sub:workflow}. However, the focus is to guarantee that the
data is tidy and in the observational unit of interest, not to prepare it for modeling.
As a result, although data might be appropriate for the learning tasks we described in
\cref{chap:slt} --- in the sense that we know what the feature vectors and the target
variable are ---, they might not be suitable for the machine learning methods we will use.
One simple example is the perceptron (\cref{sub:perceptron}) that assumes all
input variables are real numbers. If the data contains categorical variables, we must
convert them to numerical variables before applying the perceptron.
For this reason, the solution sprint tasks in \cref{sub:workflow} include not only the
learning tasks but also the \emph{data preprocessing} tasks, which are dependent on the
chosen machine learning methods.
\begin{defbox}{Data preprocessing}{preprocessing}
The process of adjusting the data to make it suitable for a particular learning machine
or, at the least, to ease the learning process.
\end{defbox}
This is done by applying a series of operations to the data, like in data handling. The
difference here is that some of the parameters of the operations are not fixed; rather, they
are fit from a data sampling. Once fitted, the operations can be applied to
new data, sample by sample.
As a result, a data processing technique acts in three steps:
\begin{enumerate}
\itemsep0em
\item \textbf{Fitting}: The parameters of the operation are adjusted to the training
data (which has already been integrated and tidied, represents well the phenomenon of
interest, and each sample is in the correct observational unit);
\item \textbf{Adjustment}: The training data is adjusted according to the fitted
parameters, eventually changing the sampling size and distribution;
\item \textbf{Applying}: The operation is applied to new data, sample by sample.
\end{enumerate}
Understanding these steps and correctly defining the behavior of each of them is crucial
to avoid \gls{leakage} and to guarantee that the model will behave as expected in
production.
\subsection{Formal definition}
\label{sub:formal-preprocessing}
Let $T = (K, H, c)$ be a table that represents the data in the desired observational unit
--- as defined in \cref{sec:formal-structured-data}. In this chapter, without loss of
generality --- as the keys are not used in the modeling process ---, we can consider $K =
\{1, 2, \dots\}$ such that $\rowcard[i] = 0$ if, and only if, $i > n$. That means that
every row $r \in \{1, \dots, n\}$ is present in the table.
A data preprocessing strategy $F$ is a function that takes a table $T = (K, H, c)$ and
returns an adjusted table $T' = (K', H', c')$ and a fitted \emph{preprocessor} $f(z; \phi)
\equiv f_\phi(z)$ such that $$z \in \bigtimes_{h\, \in\, H} \domainof{h} \cup \{?\}$$ and $\phi$ are
the fitted parameters of the operation. Similarly, $z' = f_\phi(z)$, called the
preprocessed tuple, satisfies $$z' \in \bigtimes_{h'\, \in\, H'} \domainof{h'} \cup
\{?\}\text{.}$$ Note that we make no restrictions on the number of rows in the adjusted
table, i.e., preprocessing techniques can change the number of rows in the training table.
In practice, strategy $F$ is a chain of dependent preprocessing operations $F_1$, \dots,
$F_m$ such that, given $T = T^{(0)}$, each operation $F_i$ is applied to the table
$T^{(i-1)}$ to obtain $T^{(i)}$ and the fitted preprocessor $f_{\phi_i}$. Thus, $T' =
T^{(m)}$ and $$f(z; \phi = \{\phi_1, \dots, \phi_m\}) = \left(f_{\phi_1} \circ \dots \circ
f_{\phi_m}\right)(z)\text{,}$$ where $\circ$ is the composition operator. I say that
they are dependent since none of the operations can be applied to the table without the
previous ones.
\subsection{Degeneration}
The objective of the fitted preprocessor is to adjust the data to make it suitable for the
model. However, sometimes it cannot achieve this goal for a particular input $z$. This
can happen for many reasons, such as unexpected values, information ``too incomplete'' to
make a prediction, etc.
Formally, we say that the preprocessor $f_\phi$ degenerates over tuple $z$ if it outputs
$z' = f_\phi(z)$ such that $z' = (?, \dots, ?)$. In practice, that means that the
preprocessor decided that it has no strategy to adjust the data to make it suitable for
the model. For the sake of simplicity, if any step $f_{\phi_i}$ degenerates over
tuple $z^{(i)}$, the whole preprocessing chain degenerates\footnote{Usually, this is
implemented as an exception or similar programming mechanism.} over $z = z^{(0)}$.
Consequently, in the implementation of the solution, the developer must choose a default
behavior for the model when the preprocessing chain degenerates over a tuple. It can
be as simple as returning a default value or as complex as redirecting the tuple to a
different pair of preprocessor and model. Sometimes, the developer can choose to
integrate this as an error or warning in the user application.
\subsection{Data preprocessing tasks}
The most common data preprocessing tasks can be divided into three categories:
\begin{itemize}
\itemsep0em
\item Data cleaning;
\item Data sampling; and
\item Data transformation.
\end{itemize}
In the next sections, I address some of the most common data preprocessing tasks
in each of these categories. I present them in the order they are usually applied in the
preprocessing, but note that the order is not fixed and can be changed according to the
needs of the problem.
\section{Data cleaning}
Data cleaning is the process of removing errors and inconsistencies from the data. This is
usually done to make the data more reliable for training and to avoid bias in the learning
process. Usually, such errors and inconsistencies make the learning machines ``confused''
and can lead to poor performance models.
Also, it includes the process of dealing with missing information, which most machine
learning methods do not cope with. Solutions range from the simple removal of the
observations with missing data to the creation of new information to encode the missing data.
\subsection{Treating inconsistent data}
% TODO: move this somewhere when we talk about data handling and/or tidying
% Sometimes, during data collection, information is recorded using special codes. For
% instance, the value 9999 might be used to indicate that the data is missing. Such codes
% must be replaced with more appropriate values before modeling. If a single variable
% encodes more than one concept, new variables must be created to represent each concept.
There are a few, but important, tasks to be done during data preprocessing in terms of
invalid and inconsistent data --- note that we assume that most of the issues in terms of
the semantics of the data have been solved in the data handling phase. Especially in
production, the developer must be aware of the behavior of the model when it faces
information that is not supposed to be present in the data.
One of the tasks is to ensure that physical quantities are dealt with standard units. One must
check whether all columns that store physical quantities have the same unit of
measurement. If not, one must convert the values to the same unit. A summary of this
preprocessing task is presented in \cref{tab:unit-conversion}.
\begin{tablebox}[label=tab:unit-conversion]{Unit conversion preprocessing task.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Unit conversion}} \\
\midrule
% \textbf{Requirements} &
% A variable with the physical quantity and a variable with the unit of measurement. \\
\textbf{Goal} &
Convert physical quantities into the same unit of measurement. \\
\textbf{Fitting} &
None. User must declare the units to be used and, if appropriate, the conversion
factors. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently. \\
\textbf{Applying} &
Preprocessor converts the numerical values and drops the unit of measurement column. \\
\bottomrule
\end{tabular}
\end{tablebox}
Moreover, if one knows that a variable must follow a specific range of values, we can check
whether the values are within this range. If not, one must replace the values with
missing data or with the closest valid value. Alternatively, one can discard the
observation based on that criterion. Consult \cref{tab:range-check} for a summary of this
operation.
\begin{tablebox}[label=tab:range-check]{Range check preprocessing task.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Range check}} \\
\midrule
% \textbf{Requirements} &
% A numerical variable. \\
\textbf{Goal} &
Check whether the values are within the expected range. \\
\textbf{Fitting} &
None. User must declare the valid range of values. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently. If appropriate,
degenerated samples are removed. \\
\textbf{Applying} &
Preprocessor checks whether the value $x$ of a variable is within the range $[a,
b]$. If not, it replaces $x$ with: (a) missing value $?$, (b) the closest valid
value $\max(a, \min(b, x))$, or (c) degenerates (discards the observation). \\
\bottomrule
\end{tabular}
\end{tablebox}
Another common problem in inconsistent information is that the same category might be
represented by different strings. This is usually done by creating a dictionary that maps
the different names to a single one, using standardizing lower or upper case, removing
special characters, or more advanced fuzzy matching techniques --- see
\cref{tab:text-standardization}.
\begin{tablebox}[label=tab:text-standardization]{Category standardization preprocessing task.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Category standardize}} \\
\midrule
% \textbf{Requirements} &
% A categorical variable. \\
\textbf{Goal} &
Create a dictionary and/or function to map different names to a single one. \\
\textbf{Fitting} &
None. User must declare the mapping. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently. \\
\textbf{Applying} &
Preprocessor replaces the categorical variable $x$ with the mapped
value $f(x)$ that implements case standardization, special character removal, and/or
dictionary fuzzy matching. \\
\bottomrule
\end{tabular}
\end{tablebox}
Note that these technique parameters are not fitted from the data, but rather are fixed
from the problem definition. As a result, they could be done in the data handling phase.
The reason we put them here is that the new data in production usually come with the
same issues. Having the fixes programmed into the preprocessor makes it easier to
guarantee that the model will behave as expected in production.
\subsection{Outlier detection}
Outliers are observations that are significantly different from the other observations.
They can be caused by errors or by the presence of different phenomena mixed in the data
collection process. In both cases, it is important to deal with outliers before modeling.
The standard way to deal with outliers is to remove them from the dataset. Assuming that
the errors or the out-of-distribution data appear randomly and rarely, this is a good
strategy.
Another approach is dealing with each variable independently. This way, one can replace
the outlier value with missing data. There are many ways to detect outlier values, but
the simplest one is probably a heuristic based on the \gls{iqr}.
Let $Q_1$ and $Q_3$ be the first and the third quartiles of the values in a variable,
respectively. The \gls{iqr} is defined as $Q_3 - Q_1$. The values that are less than
$Q_1 - 1.5\, \text{IQR}$ or greater than $Q_3 + 1.5\, \text{IQR}$ are considered outliers.
See \cref{tab:iqr-outlier}.
\begin{tablebox}[label=tab:iqr-outlier]{Outlier detection using the interquartile range.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Outlier detection using the IQR}} \\
\midrule
% \textbf{Requirements} &
% A numerical variable. \\
\textbf{Goal} &
Detect outliers using the IQR. \\
\textbf{Fitting} &
Store the values of $Q_1$ and $Q_3$ for each variable. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently. \\
\textbf{Applying} &
Preprocessor replaces the outlier values with missing data. \\
\bottomrule
\end{tabular}
\end{tablebox}
More sophisticated methods can be used to detect samples that are outliers, such as using
the definition of an outlier in the DBSCAN\footfullcite{Ester1996}. But, this is not
enough to fit the parameters of the preprocessor. The reason is that descriptive methods
like DBSCAN --- in this case, a method for clustering --- do not generalize to new data.
I suggest using methods like One-Class SVM\footfullcite{Scholkopf2001} to fit the
parameters of the preprocessor that detects outliers. Thus, any new data point can
be classified as an outlier or not.
Like filtering operations in the pipeline, the developer must specify a default behavior
for the model when an outlier sample is detected in production. See
\cref{tab:outlier-removal}.
\begin{tablebox}[label=tab:outlier-removal]{Task of filtering outliers.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Outlier removal}} \\
\midrule
% \textbf{Requirements} &
% A dataset with outliers. \\
\textbf{Goal} &
Remove the observations that are outliers. \\
\textbf{Fitting} &
Parameters of the outlier classifier. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently, removing
degenerated samples. \\
\textbf{Applying} &
Preprocessor degenerates if the sample is classified as an outlier and does
nothing, otherwise. \\
\bottomrule
\end{tabular}
\end{tablebox}
\subsection{Treating missing data}
Since most models cannot handle missing data, it is crucial to deal with it in the data
preprocessing.
There are four main strategies to deal with missing data:
\begin{itemize}
\itemsep0em
\item Remove the observations (rows) with missing data;
\item Remove the variables (columns) with missing data;
\item Just impute the missing data;
\item Use an indicator variable to mark the missing data and impute it.
\end{itemize}
Removing rows and columns are commonly used when the number of missing data is small
compared to the total number of rows or columns. However, be aware that removing rows
``on demand'' can
artificially change the data distribution, especially when the missing data is not missing at
random. Row removal suffers from the same problem as any filtering operations
(degeneration) in the preprocessing step; the developer must specify a default behavior
for the model when a row is discarded in production. See \cref{tab:row-removal-missing}.
\begin{tablebox}[label=tab:row-removal-missing]{Task of filtering rows based on missing data.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Row removal based on missing data}} \\
\midrule
% \textbf{Requirements} &
% A dataset with missing data. \\
\textbf{Goal} &
Remove the observations with missing data in any (or some) variables. \\
\textbf{Fitting} &
None. Variables to look for missing data are declared beforehand. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently, removing
degenerated samples. \\
\textbf{Applying} &
Preprocessor degenerates over the rows with missing data in the specified variables.
\\
\bottomrule
\end{tabular}
\end{tablebox}
In the case of column removal, the
preprocessor just learns to drop the columns that have missing data during fitting.
Beware that valuable information might be lost when removing columns for all the samples.
\begin{tablebox}[label=tab:col-drop-missing]{Task of dropping columns based on missing data.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Column removal based on missing data}} \\
\midrule
% \textbf{Requirements} &
% A dataset with missing data. \\
\textbf{Goal} &
Remove the variables with missing data. \\
\textbf{Fitting} &
All variables with missing data in the training set are marked to be removed. \\
\textbf{Adjustment} &
Columns marked are dropped from the training set. \\
\textbf{Applying} &
Preprocessor drops the chosen columns in fitting. \\
\bottomrule
\end{tabular}
\end{tablebox}
Imputing the missing data is usually done by replacing the missing values with some
statistic of the available values in the column, such as the mean, the median, or the
mode\footnote{More sophisticated methods can be used, such as the k-nearest neighbors
algorithm, for example, consult \fullcite{Troyanskaya2001}.}. This is a simple and
effective strategy, but it can introduce bias in the data, especially when the number of
samples with missing data is large. See \cref{tab:imputation}.
\begin{tablebox}[label=tab:imputation]{Task of imputing missing data.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Imputation of missing data}} \\
\midrule
% \textbf{Requirements} &
% A dataset with missing data. \\
\textbf{Goal} &
Replace the missing data with a statistic of the available values. \\
\textbf{Fitting} &
The statistic is calculated from the available data in the training set. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently. \\
\textbf{Applying} &
Preprocessor replaces the missing values with the chosen statistic. If an indicator
variable is required, it is created and filled with the logical value:
missing or not missing. \\
\bottomrule
\end{tabular}
\end{tablebox}
Just imputing data is not suitable when one is not sure whether the missing data is
missing because of a systematic error or phenomenon. A model can learn the effect of the
underlying reason for missingness for the predictive task.
In that case, creating an indicator variable is a good strategy. This is done by creating
a new column that contains a logical value indicating whether the data is missing or
not\footnote{Some kind of imputation is still needed, but we expect the model to deal
better with it since it can decide using both the indicator and the original variable.}.
Many times the indicator variable is already present in the data. For instance, in a
dataset that contains information about pregnancy, let us say the number of days since
the last pregnancy. This information will certainly be missing if sex is male
or the number of children is zero. In this case, no new indicator variable is needed.
See \cref{tab:col-drop-missing}.
\section{Data sampling}
Once data is cleaned, the next step is (typically) to sample the data. Sampling is the
process of selecting a random subset of the data or creating variations of the original
training set.
There are three main tasks that sample the data: subsampling, scope filtering, and class
balancing.
\subsection{Random sampling}
\label{sub:random-sampling}
Some machine learning methods are computationally expensive, and a smaller dataset might be
enough to solve the problem. Random sampling is simply done by selecting a random subset
of the training data with a user-defined size.
However, note that the preprocessor for this task \emph{must never do anything with the
new data} (or the test set we discuss in \cref{chap:planning}). See
\cref{tab:random-sampling}.
\begin{tablebox}[label=tab:random-sampling]{Task of random sampling.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Random sampling}} \\
\midrule
% \textbf{Requirements} &
% A dataset with the scope of the phenomenon. \\
\textbf{Goal} &
Select a random subset of the training data. \\
\textbf{Fitting} &
None. User must declare the size of the sample. \\
\textbf{Adjustment} &
Rows of the training set are randomly chosen. \\
\textbf{Applying} &
Pass-through: preprocessor does nothing with the new data. \\
\bottomrule
\end{tabular}
\end{tablebox}
\subsection{Scope filtering}
Scope filtering is the process of reducing the scope of the phenomenon we want to model.
Like the filtering operation in the data handling pipeline (consult \cref{sub:filtering}),
the data scientists choose a set of predefined rules to filter the data.
Unlike outlier detection, we assume that the rule is fixed and known beforehand. The
preprocessor degenerates over the samples that do not satisfy the rule. A summary of the
task is presented in \cref{tab:scope-filtering}.
\begin{tablebox}[label=tab:scope-filtering]{Task of filtering the scope of the data.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Scope filtering}} \\
\midrule
% \textbf{Requirements} &
% A dataset with the scope of the phenomenon. \\
\textbf{Goal} &
Remove the observations that do not satisfy a predefined rule. \\
\textbf{Fitting} &
None. User must declare the rule. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently, removing degenerated
samples. \\
\textbf{Applying} &
Preprocessor degenerates over the samples that do not satisfy the rule. \\
\bottomrule
\end{tabular}
\end{tablebox}
An interesting variation is the model trees\footfullcite{Freek2015}. They are shallow
decision trees that are used to filter the data. At each leaf, a different model is
trained with the data that satisfies the rules that reach the leaf. This is a good
strategy when the phenomenon is complex and can be divided into simpler subproblems.
In this case, the preprocessor does not degenerate over the samples, but rather the
preprocessing chain branches into different models (and potentially other preprocessing
steps).
\subsection{Class balancing}
Some data classification methods are heavily affected by the number of observations in each
class. This is especially true for methods that learn the class priors directly from the
data, like the naïve Bayes classifier.
Two strategies are often used to balance the classes: oversampling and undersampling. The
former is done by creating synthetic observations of the minority class. The latter is
done by removing observations of the majority class.
Undersampling can be done by removing observations of the majority class randomly
(similarly to random sampling, \cref{sub:random-sampling}). On the other hand,
oversampling can be done by creating synthetic observations of the minority class. The
most common method is resampling\footnote{Sometimes called bootstrapping.}, which selects
a random subset of the data with replacement. A drawback of this method is that it
produces repeated observations that contain no new information.
In any case, the preprocessor for this task \emph{must never do anything with the new
data} (or the test set we discuss in \cref{chap:planning}). See
\cref{tab:class-balancing}.
\begin{tablebox}[label=tab:class-balancing]{Task of class balancing.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Class balancing}} \\
\midrule
% \textbf{Requirements} &
% A dataset with unbalanced classes. \\
\textbf{Goal} &
Balance the number of observations in each class. \\
\textbf{Fitting} &
None. User must declare the number of samples in each class. \\
\textbf{Adjustment} &
Rows of the training set are randomly chosen. \\
\textbf{Applying} &
Pass-through: preprocessor does nothing with the new data. \\
\bottomrule
\end{tabular}
\end{tablebox}
More advanced sampling methods exist. For instance, the SMOTE
algorithm\footfullcite{chawla2002smote} creates synthetic observations of the minority
class without repeating the same observations.
\section{Data transformation}
Another important task in data handling is data transformation. This is the process of
adjusting the types of the data and the choice of variables to make it suitable for
modeling.
At this point, the data format is acceptable, i.e., each observation is in the correct
observational unit, there are no missing values, and the sampling is representative of the
phenomenon of interest. Now, we can perform a series of operations to make the
column's types and values suitable for modeling. The reason for this is that most
machine learning methods require the input variables to follow some restrictions. For
instance, some methods require the input variables to be real numbers, others require the
input variables to be in a specific range, etc.
\subsection{Type conversion}
Type conversion is the process of changing the type of the values in the columns. We do
so to make the input variables compatible with the machine learning methods we will use.
The most common type conversion is the conversion from categorical to numerical values.
Ideally, the possible values of a categorical variable are known beforehand.
For instance, given the values $x \in \{a, b, c\}$ in a column, there are two main ways to
convert them to numerical values: label encoding and one-hot encoding. If there is a
natural order $a < b < c$, label encoding is usually sufficient. Otherwise, one-hot encoding
can be used.
Label encoding is the process of replacing the values $x \in \{a, b, c\}$ with the values
$x' \in \{1, 2, 3\}$, where $x' = 1$ if $x = a$, $x' = 2$ if $x = b$, and $x' = 3$ if
$x = c$. Other numerical values can be assigned depending on the specific problem.
One-hot encoding is the process of creating a new column for each possible value
of the categorical variable. The new column is filled with the logical value $1$ if the
value is present and $0$ otherwise.
However, in the second case, the number of categories might be too large or might not be
known beforehand. So, the preprocessing step must identify the unique values in the
column and create the new columns accordingly. It is common to group the less frequent
values into a single column, called the \emph{other} column. See \cref{tab:one-hot}.
\begin{tablebox}[label=tab:one-hot]{One-hot encoding preprocessing task.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{One-hot encoding}} \\
\midrule
% \textbf{Requirements} &
% A dataset with categorical variables. \\
\textbf{Goal} &
Create a new column for each possible value of the categorical variable. \\
\textbf{Fitting} &
Store the unique values of the categorical variable. If appropriate, indicate
the special category \emph{other}. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently. \\
\textbf{Applying} &
Preprocessor creates a new column for each possible value of the categorical
variable. The new column is filled with the logical value $1$ if the old value
matches the new column and $0$ otherwise. If the value is new or among the less
frequent values, it is assigned to the \emph{other} column. \\
\bottomrule
\end{tabular}
\end{tablebox}
The other direction is also common: converting numerical values to categorical values.
This is usually done by binning the numerical variable, either by frequency or by range.
In both cases, the user declares the number of bins. Binning by frequency is done by
finding the percentiles of the values and creating the bins accordingly. Binning by
range is done by dividing the range of the values into equal parts, given the minimum and
maximum values. See \cref{tab:binning}.
\begin{tablebox}[label=tab:binning]{Binning numerical values preprocessing task.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Binning numerical values}} \\
\midrule
% \textbf{Requirements} &
% A dataset with numerical variables. \\
\textbf{Goal} &
Create a new categorical column from a numerical one. \\
\textbf{Fitting} &
Store the range of each bin. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently. \\
\textbf{Applying} &
Preprocessor converts each numerical value to a categorical value by checking
which bin the value falls into. \\
\bottomrule
\end{tabular}
\end{tablebox}
Another common task, although it receives less attention, is the conversion of dates (or
other interval variables) to numerical values. Interval variables, like dates, have
almost no information in their absolute values. However, the difference between two
dates can be very informative. For example, the difference between the date of birth and
the date of the last purchase becomes the age of the customer.
\subsection{Normalization}
Normalization is the process of scaling the values in the columns. This is usually done to
keep data within a specific range or to make different variables comparable. For instance,
some machine
learning methods require the input variables to be in the range $[0, 1]$.
The most common normalization methods are standardization and rescaling. The former is done
by subtracting the mean and dividing by the standard deviation of the values in the column.
The latter is performed so that the values are in a specific range, usually $[0, 1]$ or $[-1, 1]$.
Standardization works well when the values in the column are normally distributed.
It not only keeps the values in an expected range but also makes the data distribution
comparable with other variables. Given that $\mu$ is the mean and $\sigma$ is the
standard deviation of the values in the column, the standardization is done by
\begin{equation}
\label{eq:standardization}
x' = \frac{x - \mu}{\sigma}\text{.}
\end{equation}
See \cref{tab:standardization}.
\begin{tablebox}[label=tab:standardization]{Standardization preprocessing task.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Standardization}} \\
\midrule
% \textbf{Requirements} &
% A dataset with numerical variables. \\
\textbf{Goal} &
Scale the values in a column. \\
\textbf{Fitting} &
Store the statistics of the variable: the mean and the standard deviation. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently. \\
\textbf{Applying} &
Preprocessor scales the values according to \cref{eq:standardization}. \\
\bottomrule
\end{tabular}
\end{tablebox}
In the case of rescaling, during production, the preprocessor usually clamps\footnote{The
operation $\clamp(x; a, b)$ where $a$ and $b$ are the lower and upper bounds,
respectively, is defined as $\max(a, \min(b, x))$.} the values after rescaling. This is
done to avoid the model making predictions that are out of the range of the training
data. Given that we want to rescale the values in the column to the range $[a, b]$, and
that $x_\text{min}$ and $x_\text{max}$ are the minimum and maximum values in the column,
the rescaling is done by
\begin{equation}
\label{eq:rescaling}
x' = a + \big(b - a\big) \, \frac{x - x_\text{min}}{x_\text{max} - x_\text{min}}\text{.}
\end{equation}
See \cref{tab:rescaling}.
\begin{tablebox}[label=tab:rescaling]{Rescaling preprocessing task.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Rescaling}} \\
\midrule
% \textbf{Requirements} &
% A dataset with numerical variables. \\
\textbf{Goal} &
Rescale the values in a column. \\
\textbf{Fitting} &
Store the appropriate statistics of the variable: the minimum and the maximum
values. \\
\textbf{Adjustment} &
Training set is adjusted sample by sample, independently. \\
\textbf{Applying} &
Preprocessor scales the values according to \cref{eq:rescaling}. \\
\bottomrule
\end{tabular}
\end{tablebox}
Related to normalization is the log transformation, which applies the logarithm to the
values in the column. This is usually done to make the data distribution more symmetric
over the mean or to reduce the effect of outliers.
% TODO: figure with power law distribution and log transformation
\subsection{Dimensionality reduction}
Dimensionality reduction is the process of reducing the number of variables in the data.
It can identify irrelevant variables and reduce the complexity of the model (since there
are fewer variables to deal with).
% TODO: talk about curse of dimensionality in chapter stl,
% maybe even showing how the linear regression problem must be over-specified
% The so-called \emph{curse of dimensionality} is a
% common problem in machine
% learning, where the number of variables is much larger than the number of observations.
There are two main types of dimensionality reduction algorithms: feature selection and
feature extraction. The former selects a subset of the existing variables that leads
to the best models. The latter creates new variables that are combinations
of the original ones.
% Feature selection can be performed before modeling (filter), together with the model
% search (wrapper), or as a part of the model itself (embedded).
One example of feature selection is ranking the variables by their mutual information with
the target variable and selecting the top $k$ variables. Mutual information is a measure
of the amount of information that one variable gives about another. So, it is expected
that variables with high mutual information with the target variable are more important
for the model.
Feature extraction uses either linear methods, such as \gls{pca}, or non-linear methods,
such as autoencoders. These methods are able to compress the information in the training
data into a smaller number of variables. Thus, the model can learn the solution in
a lower-dimensional space. A drawback of this method is that the new variables are
hard to interpret, since they are combinations of the original variables.
\subsection{Data enhancement}
The ``opposite'' of dimensionality reduction is data enhancement. This is the process of
bringing to the dataset external information that complements the existing data. For
example, imagine that in the tidy data we have a column with the zip code of the
customers. We can use this information to join (in this case, always a left join) a
dataset with social and economic information about the region of the zip code.
The preprocessor, then, stores the external dataset and the column to join the data.
During production, it enhances any new observation with the external information. See
\cref{tab:data-enhancement}.
\begin{tablebox}[label=tab:data-enhancement]{Data enhancement preprocessing task.}
\centering
\rowcolors{2}{black!10!white}{}
\begin{tabular}{lp{6cm}}
\toprule
\multicolumn{2}{c}{\textbf{Data enhancement}} \\
\midrule
% \textbf{Requirements} &
% A dataset with a column to join. \\
\textbf{Goal} &
Enhance the dataset with external information. \\
\textbf{Fitting} &
Store the external dataset and the column to join. \\
\textbf{Adjustment} &
Training set is left joined with the external dataset. Because of the properties of
the left join, the new dataset has the same number of rows as the original dataset,
and it is equivalent to enhancing each row independently. \\
\textbf{Applying} &
Preprocessor enhances the new data with the external information. \\
\bottomrule
\end{tabular}
\end{tablebox}
\subsection{Comments on unstructured data}
Any unstructured data can be transformed into structured data. We can see this task as a
data preprocessing task. Techniques like bag of words, word embeddings, and signal (or
image) processing can be seen as preprocessing techniques that transform unstructured data
into structured data, which is suitable for modeling.
Also, modern machine learning methods, like \glspl{cnn}, are simply models that learn both
the preprocessing and the model at the same time. This is done by using convolutional
layers that learn the features of the data. In digital signal processing, this is called
feature extraction. The difference there is that the convolution filters are handcrafted,
while in \glspl{cnn} they are learned from the data.
The study of unstructured data is a vast field and is out of the scope of this book. I
recommend \textcite{Jurafsky2008}\footnote{\fullcite{Jurafsky2008}. A new edition is
is under preparation and it is available for free: \fullcite{Jurafsky2024}.} for a complete
introduction to Natural Language Processing and
\textcite{Szeliski2022}\footfullcite{Szeliski2022} for a comprehensive introduction to
Computer Vision.
% vim: spell spelllang=en