-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes multidimensional regression #177
Conversation
… it synchronized and resetting the RNG everywhere.
…nal models. Adding a test to SGD linear.
…hen deserializing old models that have been fixed.
Regression/Core/src/main/java/org/tribuo/regression/ImmutableRegressionInfo.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the code is updated references to the dimension ids using the updated mapping, a bunch of new/updated unit tests to make sure it all works correctly, new readObject methods, and a rewrite of ImmutableRegressionInfo
some smaller changes include: fixing LinearRegressionType return values for isClassification and isRegression, changing use of LinkedList to ArrayDeque
* Fixing RegressionInfo so the ids are assigned in lexicographic order. * Fixing the id assignment issue in LibLinearRegressionTrainer. * Actually fix the issue in LibLinearRegressionTrainer. * Fix a bug where LibLinearRegressionType reported itself as classification. * Fixing LibSVMRegressionTrainer. * Fix XGBoost. * Fix a bug in standardized multidimensional LibSVM regressions. * Working on tests for LibSVM regression. * Adding mapping methods to the regression info. * Trying a fix for LibLinear. * Fixing a concurrency and reproducibility issue in liblinear by making it synchronized and resetting the RNG everywhere. * Tidying up the liblinear tests. * Fixing TensorFlow. * Fixing regression trees. * Updating XGBoost fix. * Fix for liblinear so models deserialize correctly. * Adding an example config file for CART regression trees. * Fixing LibSVM deserialization. * Fixing ElasticNetCDTrainer as it also emitted corrupted multidimensional models. Adding a test to SGD linear. * Fixing trees. * Adding an id test to the regression ensembles. * Fixing TensorFlow again. * Fixing the regression SGD test so it is reproducible. * Fix XGBoost so it re-orders things on deserialization. * Fix for XGBoost and SLM so they don't re-order the dimensions twice when deserializing old models that have been fixed. PR Text: Fixes `ImmutableRegressionInfo` so it generates the regression ids in lexicographic order. It then fixes all the models so they correctly map their dimensions through the `ImmutableRegressionInfo` so even if it's not in lexicographic order then the models produce the correct output. Finally it adds deserialization hooks to `LibLinearRegressionModel`, `LibSVMRegressionModel`, `XGBoostModel` and `SparseLinearModel`s trained using `ElasticNetCDTrainer` so that older models are rewritten on deserialization to fix the issue. Once those models have been loaded by a version of Tribuo with this fix they will work correctly in older versions (though if you then save them again in the older version they will become irretrievably corrupt as it'll lose the flag that tracks if the model has been fixed). Separately this PR also fixes an issue where multidimensional LibSVM models are corrupted when using the standardization flag. Only the first dimension was stored correctly, the others are copies of that dimension. Unfortunately tree based models (`TreeModel<Regresssor>`, `IndependentRegressionTreeModel` and any ensembles created using them) are more deeply corrupted and a fix would require completely rewriting the tree structure. We recommend users retrain those models in a newer version of Tribuo which contains this fix. Additionally TensorFlow based multidimensional regression models are likely impacted by this issue and should be retrained. One further additional change in this PR is that `XGBoostModel` now correctly reports the top features on a per regression dimension basis rather than aggregating them into a single list as it did previously. This was useful for debugging the various fixes to XGBoost. Multidimensional regression models in v4.0 and v4.1.0 have an incorrect mapping between ids and regression dimension names. The root cause is two-fold, due to an implicit contract which was not enforced (which is that the regression id numbers are generated using a lexicographic sort of the regression dimension names). The first cause is that when building an `ImmutableRegressionInfo` the labels were accidentally put into a non-sorted Map, and then iterated over in that map causing the ids to be hash order dependent, rather than lexicographically sorted. This shouldn't have been enough to break Tribuo, as all id mappings should go through the `ImmutableRegressionInfo` anyway. The second cause is that they didn't and several trainers/models assumed that the regression values were stored in id order in the `Regressor` object itself. The LinearSGDModel and SparseLinearModels (aside from ones generated using the ElasticNetCDTrainer) are correct in older versions of Tribuo. LibLinear, LibSVM and XGBoost models produce the correct output, but are corrupted internally so store the models in the incorrect locations. Tree models are entirely corrupted and should be retrained from scratch. Note single dimensional regression models are unaffected by this bug as it only occurs when there are two dimensions which could be stored out of order, and we expect those are the vast majority of trained regression models.
Description
Fixes
ImmutableRegressionInfo
so it generates the regression ids in lexicographic order. It then fixes all the models so they correctly map their dimensions through theImmutableRegressionInfo
so even if it's not in lexicographic order then the models produce the correct output. Finally it adds deserialization hooks toLibLinearRegressionModel
,LibSVMRegressionModel
,XGBoostModel
andSparseLinearModel
s trained usingElasticNetCDTrainer
so that older models are rewritten on deserialization to fix the issue. Once those models have been loaded by a version of Tribuo with this fix they will work correctly in older versions (though if you then save them again in the older version they will become irretrievably corrupt as it'll lose the flag that tracks if the model has been fixed).Separately this PR also fixes an issue where multidimensional LibSVM models are corrupted when using the standardization flag. Only the first dimension was stored correctly, the others are copies of that dimension.
Unfortunately tree based models (
TreeModel<Regresssor>
,IndependentRegressionTreeModel
and any ensembles created using them) are more deeply corrupted and a fix would require completely rewriting the tree structure. We recommend users retrain those models in a newer version of Tribuo which contains this fix. Additionally TensorFlow based multidimensional regression models are likely impacted by this issue and should be retrained.One further additional change in this PR is that
XGBoostModel
now correctly reports the top features on a per regression dimension basis rather than aggregating them into a single list as it did previously. This was useful for debugging the various fixes to XGBoost.Motivation
Multidimensional regression models in v4.0 and v4.1.0 have an incorrect mapping between ids and regression dimension names. The root cause is two-fold, due to an implicit contract which was not enforced (which is that the regression id numbers are generated using a lexicographic sort of the regression dimension names). The first cause is that when building an
ImmutableRegressionInfo
the labels were accidentally put into a non-sorted Map, and then iterated over in that map causing the ids to be hash order dependent, rather than lexicographically sorted. This shouldn't have been enough to break Tribuo, as all id mappings should go through theImmutableRegressionInfo
anyway. The second cause is that they didn't and several trainers/models assumed that the regression values were stored in id order in theRegressor
object itself.The LinearSGDModel and SparseLinearModels (aside from ones generated using the ElasticNetCDTrainer) are correct in older versions of Tribuo. LibLinear, LibSVM and XGBoost models produce the correct output, but are corrupted internally so store the models in the incorrect locations. Tree models are entirely corrupted and should be retrained from scratch.
Note single dimensional regression models are unaffected by this bug as it only occurs when there are two dimensions which could be stored out of order, and we expect those are the vast majority of trained regression models.