Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for standardising regression values in LibSVM #113

Merged
merged 6 commits into from
Mar 3, 2021

Conversation

Craigacp
Copy link
Member

@Craigacp Craigacp commented Feb 8, 2021

Description

Adds an option to LibSVMRegressionTrainer which standardises the input (i.e. ensures it's mean zero variance one) before training a model. It then applies the inverse transformation to the predictions to map them back into the correct output space. It also adds a non-linear regression data generator for testing, adds a couple of related methods in org.tribuo.util.Util and tidies up some related comments.

The internal implementation is quite ugly as it extends svm_model to poke in the mean and variance, because the types in the base LibSVMTrainer are too restrictive. When we have a breaking version it might be worth refactoring this and LibLinearTrainer (which is similar) to make the internal types controlled by Tribuo so they can be more easily extended if necessary (or just make them Map<String,Object>).

The standardization is off by default to preserve compatibility with Tribuo 4.0, but the performance is pretty poor on non-linear problems with it turned off, so it might be better to turn it on by default.

Motivation

LibSVM has a real problem with non-standardized regression inputs, and performs very poorly. Using the new non-linear data generator an RBF SVM gets:

Multi-dimensional Regression Evaluation
RMSE = {Y=24.412388852859543}
Mean Absolute Error = {Y=11.827119549115682}
R^2 = {Y=0.7729975961406879}
explained variance = {Y=0.773032506305136}

and with standardization it gets:

Multi-dimensional Regression Evaluation
RMSE = {Y=1.214990405958973}
Mean Absolute Error = {Y=0.9604027608577747}
R^2 = {Y=0.9994377161686859}
explained variance = {Y=0.9994384243963165}

Fixes #73. I tested the other regressors and none of them exhibit the same issue on non-standardized data. If there are future reports of a similar issue in the other regressors we can roll out a similar solution.

Copy link
Member

@eelstretching eelstretching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, mostly javadoc fixes.

@@ -1040,4 +1055,36 @@ public static String formatDuration(long startMillis, long stopMillis) {
return diffIndicesList.stream().mapToInt(Integer::intValue).toArray();
}

/**
* Standardizes the input, i.e. divides it by the variance and subtracts the mean.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The i.e. comment here is weird. Should say something like "a value is standardized by subtracting the mean and dividing the result by the variance". The way it's stated sounds like we divide first and then subtract, which is incorrect (assuming the code is correct, which I think it is.)

standardizeInPlace javadoc has the same problem.

I guess I'm curious why these methods are static when we have class members for doing the accumulation? Are they used elsewhere? If they are and staticness makes sense for these, then there should be non-static ones that use the mean and variance that we've accumulated on an input array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added class members to the MeanVarianceAccumulator which performs standardization by calling into Util.standardize. The MeanVarianceAccumulator didn't end up being used by the standardization code in LibSVMRegressionTrainer, but my plan is to migrate all other the places where I do this over to this code over time, so I put it into this PR along with it's test.


/**
* Generates a single dimensional output drawn from
* N(w_0*x_0 - w_1*x_1 + w_2*x_1*x_0 + w_3*x_1*x_1*x_1 + intercept,variance).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure the - w_1*x_1 should be + in the javadoc here, to match the code below? This error appears elsewhere in the javadoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the javadoc & comment. When I realised I'd made weight parameterisable the sign didn't matter anymore and I forgot to update the docs.

Copy link
Member

@eelstretching eelstretching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@Craigacp Craigacp merged commit b6c8f6f into main Mar 3, 2021
@Craigacp Craigacp deleted the regression-rescaling branch March 3, 2021 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scaling/Rescaling needed for regression outputs
2 participants