Release 0.4.0
Linfa's 0.4.0 release introduces four new algorithms, improves documentation of the ICA and K-means implementations, adds more benchmarks to K-Means and updates to ndarray's 0.14 version.
New algorithms
The Partial Least Squares Regression model family is added in this release (thanks to @relf). It projects the observable, as well as predicted variables to a latent space and maximizes the correlation for them. For problems with a large number of targets or collinear predictors it gives a better performance when compared to standard regression. For more information look into the documentation of linfa-pls
.
A wrapper for Barnes-Hut t-SNE is also added in this release. The t-SNE algorithm is often used for data visualization and projects data in a high-dimensional space to a similar representation in two/three dimension. It does so by maximizing the Kullback-Leibler Divergence between the high dimensional source distribution to the target distribution. The Barnes-Hut approximation improves the runtime drastically while retaining the performance. Kudos to github/frjnn for providing an implementation!
A new preprocessing crate makes working with textual data and data normalization easy (thanks to @Sauro98). It implements count-vectorizer and IT-IDF normalization for text pre-processing. Normalizations for signals include linear scaling, norm scaling and whitening with PCA/ZCA/choelsky. An example with a Naive Bayes model achieves 84% F1 score for predicting categories alt.atheism
, talk.religion.misc
, comp.graphics
and sci.space
on a news dataset.
Platt scaling calibrates a real-valued classification model to probabilities over two classes. This is used for the SV classification when probabilities are required. Further a multi class model, combining multiple binary models (e.g. calibrated SVM models) into a single multi-class model is also added. These composing models are moved to the linfa/src/composing/
subfolder.
Improvements
Numerous improvements are added to the KMeans implementation, thanks to @YuhanLiin. The implementation is optimized for offline training, an incremental training model is added and KMeans++/KMeans|| initialization gives good initial cluster means for medium and large datasets.
We also moved to ndarray's version 0.14 and introduced F::cast
for simpler floating point casting. The trait signature of linfa::Fit
is changed such that it always returns a Result
and error handling is added for the linfa-logistic
and linfa-reduction
subcrates.
You often have to compare several model parametrization with k-folding. For this a new function cross_validate
is added which takes the number of folds, model parameters and a closure for the evaluation metric. It automatically calls k-folding and averages the metric over the folds. To compare different L1 ratios of an elasticnet model, you can use it in the following way:
// L1 ratios to compare
let ratios = vec![0.1, 0.2, 0.5, 0.7, 1.0];
// create a model for each parameter
let models = ratios
.iter()
.map(|ratio| ElasticNet::params().penalty(0.3).l1_ratio(*ratio))
.collect::<Vec<_>>();
// get the mean r2 validation score across 5 folds for each model
let r2_values =
dataset.cross_validate(5, &models, |prediction, truth| prediction.r2(&truth))?;
// show the mean r2 score for each parameter choice
for (ratio, r2) in ratios.iter().zip(r2_values.iter()) {
println!("L1 ratio: {}, r2 score: {}", ratio, r2);
}
Other changes
- fix for border points in the DBSCAN implementation
- improved documentation of the ICA subcrate
- prevent overflowing code example in website