Adds support for standardising regression values in LibSVM #113

Craigacp · 2021-02-08T20:22:27Z

Description

Adds an option to LibSVMRegressionTrainer which standardises the input (i.e. ensures it's mean zero variance one) before training a model. It then applies the inverse transformation to the predictions to map them back into the correct output space. It also adds a non-linear regression data generator for testing, adds a couple of related methods in org.tribuo.util.Util and tidies up some related comments.

The internal implementation is quite ugly as it extends svm_model to poke in the mean and variance, because the types in the base LibSVMTrainer are too restrictive. When we have a breaking version it might be worth refactoring this and LibLinearTrainer (which is similar) to make the internal types controlled by Tribuo so they can be more easily extended if necessary (or just make them Map<String,Object>).

The standardization is off by default to preserve compatibility with Tribuo 4.0, but the performance is pretty poor on non-linear problems with it turned off, so it might be better to turn it on by default.

Motivation

LibSVM has a real problem with non-standardized regression inputs, and performs very poorly. Using the new non-linear data generator an RBF SVM gets:

Multi-dimensional Regression Evaluation
RMSE = {Y=24.412388852859543}
Mean Absolute Error = {Y=11.827119549115682}
R^2 = {Y=0.7729975961406879}
explained variance = {Y=0.773032506305136}

and with standardization it gets:

Multi-dimensional Regression Evaluation
RMSE = {Y=1.214990405958973}
Mean Absolute Error = {Y=0.9604027608577747}
R^2 = {Y=0.9994377161686859}
explained variance = {Y=0.9994384243963165}

Fixes #73. I tested the other regressors and none of them exhibit the same issue on non-standardized data. If there are future reports of a similar issue in the other regressors we can roll out a similar solution.

…rg.tribuo.util.

…ment in CompletelyConfigurableTrainTest.

eelstretching

Looks good, mostly javadoc fixes.

Core/src/main/java/org/tribuo/util/MeanVarianceAccumulator.java

eelstretching · 2021-03-03T15:05:11Z

Core/src/main/java/org/tribuo/util/Util.java

@@ -1040,4 +1055,36 @@ public static String formatDuration(long startMillis, long stopMillis) {
        return diffIndicesList.stream().mapToInt(Integer::intValue).toArray();
    }

+    /**
+     * Standardizes the input, i.e. divides it by the variance and subtracts the mean.


The i.e. comment here is weird. Should say something like "a value is standardized by subtracting the mean and dividing the result by the variance". The way it's stated sounds like we divide first and then subtract, which is incorrect (assuming the code is correct, which I think it is.)

standardizeInPlace javadoc has the same problem.

I guess I'm curious why these methods are static when we have class members for doing the accumulation? Are they used elsewhere? If they are and staticness makes sense for these, then there should be non-static ones that use the mean and variance that we've accumulated on an input array.

I added class members to the MeanVarianceAccumulator which performs standardization by calling into Util.standardize. The MeanVarianceAccumulator didn't end up being used by the standardization code in LibSVMRegressionTrainer, but my plan is to migrate all other the places where I do this over to this code over time, so I put it into this PR along with it's test.

Regression/Core/src/main/java/org/tribuo/regression/example/GaussianDataSource.java

eelstretching · 2021-03-03T15:17:48Z

Regression/Core/src/main/java/org/tribuo/regression/example/NonlinearGaussianDataSource.java

+
+/**
+ * Generates a single dimensional output drawn from
+ * N(w_0*x_0 - w_1*x_1 + w_2*x_1*x_0 + w_3*x_1*x_1*x_1 + intercept,variance).


Pretty sure the - w_1*x_1 should be + in the javadoc here, to match the code below? This error appears elsewhere in the javadoc.

I fixed the javadoc & comment. When I realised I'd made weight parameterisable the sign didn't matter anymore and I forgot to update the docs.

eelstretching

Looks good!

Craigacp added 5 commits February 7, 2021 13:38

Adding an online mean variance accumulator. Tidying up some docs in o…

2c4af23

…rg.tribuo.util.

Adding an option to standardize a LibSVM regression model.

5fc18ba

Adding a nonlinear Gaussian data generator.

d2fd6a7

Adding an example configuration for the regression data generators.

e2c9f6b

Cleaning up RegressionInfo.toString(), and correcting the usage state…

2390138

…ment in CompletelyConfigurableTrainTest.

eelstretching reviewed Mar 3, 2021

View reviewed changes

Fixes from the review.

040bc48

eelstretching approved these changes Mar 3, 2021

View reviewed changes

Craigacp merged commit b6c8f6f into main Mar 3, 2021

Craigacp deleted the regression-rescaling branch March 3, 2021 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for standardising regression values in LibSVM #113

Adds support for standardising regression values in LibSVM #113

Craigacp commented Feb 8, 2021

eelstretching left a comment

eelstretching Mar 3, 2021

Craigacp Mar 3, 2021

eelstretching Mar 3, 2021

Craigacp Mar 3, 2021

eelstretching left a comment

Adds support for standardising regression values in LibSVM #113

Adds support for standardising regression values in LibSVM #113

Conversation

Craigacp commented Feb 8, 2021

Description

Motivation

eelstretching left a comment

Choose a reason for hiding this comment

eelstretching Mar 3, 2021

Choose a reason for hiding this comment

Craigacp Mar 3, 2021

Choose a reason for hiding this comment

eelstretching Mar 3, 2021

Choose a reason for hiding this comment

Craigacp Mar 3, 2021

Choose a reason for hiding this comment

eelstretching left a comment

Choose a reason for hiding this comment