-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for standardising regression values in LibSVM #113
Conversation
…ment in CompletelyConfigurableTrainTest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, mostly javadoc fixes.
Core/src/main/java/org/tribuo/util/MeanVarianceAccumulator.java
Outdated
Show resolved
Hide resolved
@@ -1040,4 +1055,36 @@ public static String formatDuration(long startMillis, long stopMillis) { | |||
return diffIndicesList.stream().mapToInt(Integer::intValue).toArray(); | |||
} | |||
|
|||
/** | |||
* Standardizes the input, i.e. divides it by the variance and subtracts the mean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The i.e. comment here is weird. Should say something like "a value is standardized by subtracting the mean and dividing the result by the variance". The way it's stated sounds like we divide first and then subtract, which is incorrect (assuming the code is correct, which I think it is.)
standardizeInPlace javadoc has the same problem.
I guess I'm curious why these methods are static when we have class members for doing the accumulation? Are they used elsewhere? If they are and staticness makes sense for these, then there should be non-static ones that use the mean and variance that we've accumulated on an input array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added class members to the MeanVarianceAccumulator
which performs standardization by calling into Util.standardize
. The MeanVarianceAccumulator
didn't end up being used by the standardization code in LibSVMRegressionTrainer
, but my plan is to migrate all other the places where I do this over to this code over time, so I put it into this PR along with it's test.
Regression/Core/src/main/java/org/tribuo/regression/example/GaussianDataSource.java
Show resolved
Hide resolved
|
||
/** | ||
* Generates a single dimensional output drawn from | ||
* N(w_0*x_0 - w_1*x_1 + w_2*x_1*x_0 + w_3*x_1*x_1*x_1 + intercept,variance). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure the - w_1*x_1 should be + in the javadoc here, to match the code below? This error appears elsewhere in the javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the javadoc & comment. When I realised I'd made weight parameterisable the sign didn't matter anymore and I forgot to update the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Description
Adds an option to LibSVMRegressionTrainer which standardises the input (i.e. ensures it's mean zero variance one) before training a model. It then applies the inverse transformation to the predictions to map them back into the correct output space. It also adds a non-linear regression data generator for testing, adds a couple of related methods in
org.tribuo.util.Util
and tidies up some related comments.The internal implementation is quite ugly as it extends
svm_model
to poke in the mean and variance, because the types in the baseLibSVMTrainer
are too restrictive. When we have a breaking version it might be worth refactoring this andLibLinearTrainer
(which is similar) to make the internal types controlled by Tribuo so they can be more easily extended if necessary (or just make themMap<String,Object>
).The standardization is off by default to preserve compatibility with Tribuo 4.0, but the performance is pretty poor on non-linear problems with it turned off, so it might be better to turn it on by default.
Motivation
LibSVM has a real problem with non-standardized regression inputs, and performs very poorly. Using the new non-linear data generator an RBF SVM gets:
and with standardization it gets:
Fixes #73. I tested the other regressors and none of them exhibit the same issue on non-standardized data. If there are future reports of a similar issue in the other regressors we can roll out a similar solution.