Skip to content

Reading CSV Data

Nejc Ilenic edited this page Mar 28, 2019 · 6 revisions

doddle-model expects a certain format of input CSV files. Namely, there should be two header lines present at the top of each file; the first one must contain feature names and the second one must contain feature types. Currently, there are two feature types supported; numerical features (denoted with n) and categorical features (denoted with c). Categorical features are encoded to numerical values using a label encoder during the loading of a file. All types of features support missing values (denoted as NA in the example below).

An example of a file:

sepal_length,sepal_width,petal_length,petal_width,label
n,n,n,n,c
NA,3.5,1.4,0.2,Iris Setosa
4.9,3.0,1.4,0.2,Iris Setosa
4.7,NA,1.3,0.2,Iris Setosa
5.1,3.1,1.5,0.2,Iris Setosa
5.0,3.6,1.4,0.2,NA
...

The file can then be read with the following function:

val (data, featureIndex) = loadCsvDataset(new File("/path/to/local/data/dataset.csv"), naString = "NA")

// 'label' is the last column, drop it from features and feature index
val (x, y) = (data(::, 0 to -2), data(::, -1))
val fixedFeatureIndex = featureIndex.drop(x.cols)

A fully working code example can be found here.

Clone this wiki locally