import preproc "github.com/rom1mouret/ml-essentials/preprocessing"
- FloatImputer
- Scaler
- HashEncoder
- OneHotEncoder
- AutoPreprocessor, a processor that combines the 4 components above.
Preprocessors follow these design principles:
For example, FloatImputer
will use every float column.
If you want to train FloatImputer
on a subset of float columns, use ColumnView
as follows:
imputer := preproc.NewFloatImputer(preproc.FloatImputerOptions{Policy: Mean})
imputer.Fit(df.ColumnView("height", "age"))
Once the imputer is trained on a subset of columns like "height" and "age", it does not matter what other columns come along when performing the transformation:
imputer.Fit(df.ColumnView("height", "age"))
imputer.TransformInplace(df.ColumnView("height", "age", "weight"))
In the example above, "weight" will be ignored.
See preprocessing/interfaces.go
// serialization
serialized, err = json.Marshal(preproc)
// deserialization
preproc = &preprocessing.AutoPreprocessor{}
json.Unmarshal([]byte(serialized), &preproc)
To one-hot strings, first run a HashEncoder
to transform strings into integers. Then call OneHotEncoder
to transform integer categories into boolean columns.
Later, we may implement an OrdinalEncoder
as an alternative to HashEncoder
, but the chance of hashing collision is extremely low on 64-bit systems, so I would recommend that you stick to HashEncoder
on such systems.
To avoid any confusion, let me clarify that HashEncoder
does not vectorize categories via feature hashing. Vectorizing is the job of OneHotEncoder
and HashEncoder
does not project categories onto a lower-dimension space.