-
Notifications
You must be signed in to change notification settings - Fork 4
Reading CSV Data
Nejc Ilenic edited this page Mar 28, 2019
·
6 revisions
doddle-model
expects a certain format of input CSV files. Namely, there should be two header lines present at the top of each file; the first one must contain feature names and the second one must contain feature types. Currently, there are two feature types supported; numerical features (denoted with n
) and categorical features (denoted with c
). Categorical features are encoded to numerical values using a label encoder during the loading of a file. All types of features support missing values (denoted as NA
in the example below).
An example of a file:
sepal_length,sepal_width,petal_length,petal_width,label
n,n,n,n,c
NA,3.5,1.4,0.2,Iris Setosa
4.9,3.0,1.4,0.2,Iris Setosa
4.7,NA,1.3,0.2,Iris Setosa
5.1,3.1,1.5,0.2,Iris Setosa
5.0,3.6,1.4,0.2,NA
...
The file can then be read with the following function:
val (data, featureIndex) = loadCsvDataset(new File("/path/to/local/data/dataset.csv"), naString = "NA")
// 'label' is the last column, drop it from data and feature index
val (x, y) = (data(::, 0 to -2), data(::, -1))
val fixedFeatureIndex = featureIndex.drop(x.cols)
A fully working code example can be found here.