Fixes multidimensional regression #177

Craigacp · 2021-09-24T15:28:04Z

Description

Fixes ImmutableRegressionInfo so it generates the regression ids in lexicographic order. It then fixes all the models so they correctly map their dimensions through the ImmutableRegressionInfo so even if it's not in lexicographic order then the models produce the correct output. Finally it adds deserialization hooks to LibLinearRegressionModel, LibSVMRegressionModel, XGBoostModel and SparseLinearModels trained using ElasticNetCDTrainer so that older models are rewritten on deserialization to fix the issue. Once those models have been loaded by a version of Tribuo with this fix they will work correctly in older versions (though if you then save them again in the older version they will become irretrievably corrupt as it'll lose the flag that tracks if the model has been fixed).

Separately this PR also fixes an issue where multidimensional LibSVM models are corrupted when using the standardization flag. Only the first dimension was stored correctly, the others are copies of that dimension.

Unfortunately tree based models (TreeModel<Regresssor>, IndependentRegressionTreeModel and any ensembles created using them) are more deeply corrupted and a fix would require completely rewriting the tree structure. We recommend users retrain those models in a newer version of Tribuo which contains this fix. Additionally TensorFlow based multidimensional regression models are likely impacted by this issue and should be retrained.

One further additional change in this PR is that XGBoostModel now correctly reports the top features on a per regression dimension basis rather than aggregating them into a single list as it did previously. This was useful for debugging the various fixes to XGBoost.

Motivation

Multidimensional regression models in v4.0 and v4.1.0 have an incorrect mapping between ids and regression dimension names. The root cause is two-fold, due to an implicit contract which was not enforced (which is that the regression id numbers are generated using a lexicographic sort of the regression dimension names). The first cause is that when building an ImmutableRegressionInfo the labels were accidentally put into a non-sorted Map, and then iterated over in that map causing the ids to be hash order dependent, rather than lexicographically sorted. This shouldn't have been enough to break Tribuo, as all id mappings should go through the ImmutableRegressionInfo anyway. The second cause is that they didn't and several trainers/models assumed that the regression values were stored in id order in the Regressor object itself.

The LinearSGDModel and SparseLinearModels (aside from ones generated using the ElasticNetCDTrainer) are correct in older versions of Tribuo. LibLinear, LibSVM and XGBoost models produce the correct output, but are corrupted internally so store the models in the incorrect locations. Tree models are entirely corrupted and should be retrained from scratch.

Note single dimensional regression models are unaffected by this bug as it only occurs when there are two dimensions which could be stored out of order, and we expect those are the vast majority of trained regression models.

…tion.

… it synchronized and resetting the RNG everywhere.

…nal models. Adding a test to SGD linear.

…hen deserializing old models that have been fixed.

Regression/Core/src/main/java/org/tribuo/regression/ImmutableRegressionInfo.java

pogren

most of the code is updated references to the dimension ids using the updated mapping, a bunch of new/updated unit tests to make sure it all works correctly, new readObject methods, and a rewrite of ImmutableRegressionInfo
some smaller changes include: fixing LinearRegressionType return values for isClassification and isRegression, changing use of LinkedList to ArrayDeque

* Fixing RegressionInfo so the ids are assigned in lexicographic order. * Fixing the id assignment issue in LibLinearRegressionTrainer. * Actually fix the issue in LibLinearRegressionTrainer. * Fix a bug where LibLinearRegressionType reported itself as classification. * Fixing LibSVMRegressionTrainer. * Fix XGBoost. * Fix a bug in standardized multidimensional LibSVM regressions. * Working on tests for LibSVM regression. * Adding mapping methods to the regression info. * Trying a fix for LibLinear. * Fixing a concurrency and reproducibility issue in liblinear by making it synchronized and resetting the RNG everywhere. * Tidying up the liblinear tests. * Fixing TensorFlow. * Fixing regression trees. * Updating XGBoost fix. * Fix for liblinear so models deserialize correctly. * Adding an example config file for CART regression trees. * Fixing LibSVM deserialization. * Fixing ElasticNetCDTrainer as it also emitted corrupted multidimensional models. Adding a test to SGD linear. * Fixing trees. * Adding an id test to the regression ensembles. * Fixing TensorFlow again. * Fixing the regression SGD test so it is reproducible. * Fix XGBoost so it re-orders things on deserialization. * Fix for XGBoost and SLM so they don't re-order the dimensions twice when deserializing old models that have been fixed. PR Text: Fixes `ImmutableRegressionInfo` so it generates the regression ids in lexicographic order. It then fixes all the models so they correctly map their dimensions through the `ImmutableRegressionInfo` so even if it's not in lexicographic order then the models produce the correct output. Finally it adds deserialization hooks to `LibLinearRegressionModel`, `LibSVMRegressionModel`, `XGBoostModel` and `SparseLinearModel`s trained using `ElasticNetCDTrainer` so that older models are rewritten on deserialization to fix the issue. Once those models have been loaded by a version of Tribuo with this fix they will work correctly in older versions (though if you then save them again in the older version they will become irretrievably corrupt as it'll lose the flag that tracks if the model has been fixed). Separately this PR also fixes an issue where multidimensional LibSVM models are corrupted when using the standardization flag. Only the first dimension was stored correctly, the others are copies of that dimension. Unfortunately tree based models (`TreeModel<Regresssor>`, `IndependentRegressionTreeModel` and any ensembles created using them) are more deeply corrupted and a fix would require completely rewriting the tree structure. We recommend users retrain those models in a newer version of Tribuo which contains this fix. Additionally TensorFlow based multidimensional regression models are likely impacted by this issue and should be retrained. One further additional change in this PR is that `XGBoostModel` now correctly reports the top features on a per regression dimension basis rather than aggregating them into a single list as it did previously. This was useful for debugging the various fixes to XGBoost. Multidimensional regression models in v4.0 and v4.1.0 have an incorrect mapping between ids and regression dimension names. The root cause is two-fold, due to an implicit contract which was not enforced (which is that the regression id numbers are generated using a lexicographic sort of the regression dimension names). The first cause is that when building an `ImmutableRegressionInfo` the labels were accidentally put into a non-sorted Map, and then iterated over in that map causing the ids to be hash order dependent, rather than lexicographically sorted. This shouldn't have been enough to break Tribuo, as all id mappings should go through the `ImmutableRegressionInfo` anyway. The second cause is that they didn't and several trainers/models assumed that the regression values were stored in id order in the `Regressor` object itself. The LinearSGDModel and SparseLinearModels (aside from ones generated using the ElasticNetCDTrainer) are correct in older versions of Tribuo. LibLinear, LibSVM and XGBoost models produce the correct output, but are corrupted internally so store the models in the incorrect locations. Tree models are entirely corrupted and should be retrained from scratch. Note single dimensional regression models are unaffected by this bug as it only occurs when there are two dimensions which could be stored out of order, and we expect those are the vast majority of trained regression models.

Craigacp added 25 commits September 24, 2021 11:14

Fixing RegressionInfo so the ids are assigned in lexicographic order.

c18a3ed

Fixing the id assignment issue in LibLinearRegressionTrainer.

96ac9e9

Actually fix the issue in LibLinearRegressionTrainer.

6956b2f

Fix a bug where LibLinearRegressionType reported itself as classifica…

c7037ea

…tion.

Fixing LibSVMRegressionTrainer.

c6ca6b3

Fix XGBoost.

9f49a19

Fix a bug in standardized multidimensional LibSVM regressions.

f0f2a41

Working on tests for LibSVM regression.

faa6b16

Adding mapping methods to the regression info.

4ceff12

Trying a fix for LibLinear.

bd58560

Fixing a concurrency and reproducibility issue in liblinear by making…

6d4f601

… it synchronized and resetting the RNG everywhere.

Tidying up the liblinear tests.

ebec67f

Fixing TensorFlow.

c38fded

Fixing regression trees.

e370a95

Updating XGBoost fix.

2a4b946

Fix for liblinear so models deserialize correctly.

82fb7b3

Adding an example config file for CART regression trees.

531cc57

Fixing LibSVM deserialization.

a722a32

Fixing ElasticNetCDTrainer as it also emitted corrupted multidimensio…

644f3c7

…nal models. Adding a test to SGD linear.

Fixing trees.

048f0bd

Adding an id test to the regression ensembles.

5027fab

Fixing TensorFlow again.

9fb8066

Fixing the regression SGD test so it is reproducible.

aa711e4

Fix XGBoost so it re-orders things on deserialization.

52515d6

Fix for XGBoost and SLM so they don't re-order the dimensions twice w…

45e9e2c

…hen deserializing old models that have been fixed.

Craigacp added Oracle employee This PR is from an Oracle employee squash-commits Squash the commits when merging this PR labels Sep 27, 2021

pogren reviewed Oct 1, 2021

View reviewed changes

Regression/Core/src/main/java/org/tribuo/regression/ImmutableRegressionInfo.java Show resolved Hide resolved

pogren approved these changes Oct 1, 2021

View reviewed changes

Craigacp merged commit 823beb4 into main Oct 1, 2021

Craigacp deleted the regression-id-fix branch October 1, 2021 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes multidimensional regression #177

Fixes multidimensional regression #177

Craigacp commented Sep 24, 2021 •

edited

Loading

pogren left a comment

Fixes multidimensional regression #177

Fixes multidimensional regression #177

Conversation

Craigacp commented Sep 24, 2021 • edited Loading

Description

Motivation

pogren left a comment

Choose a reason for hiding this comment

Craigacp commented Sep 24, 2021 •

edited

Loading