Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes multidimensional regression #177

Merged
merged 25 commits into from
Oct 1, 2021
Merged

Fixes multidimensional regression #177

merged 25 commits into from
Oct 1, 2021

Conversation

Craigacp
Copy link
Member

@Craigacp Craigacp commented Sep 24, 2021

Description

Fixes ImmutableRegressionInfo so it generates the regression ids in lexicographic order. It then fixes all the models so they correctly map their dimensions through the ImmutableRegressionInfo so even if it's not in lexicographic order then the models produce the correct output. Finally it adds deserialization hooks to LibLinearRegressionModel, LibSVMRegressionModel, XGBoostModel and SparseLinearModels trained using ElasticNetCDTrainer so that older models are rewritten on deserialization to fix the issue. Once those models have been loaded by a version of Tribuo with this fix they will work correctly in older versions (though if you then save them again in the older version they will become irretrievably corrupt as it'll lose the flag that tracks if the model has been fixed).

Separately this PR also fixes an issue where multidimensional LibSVM models are corrupted when using the standardization flag. Only the first dimension was stored correctly, the others are copies of that dimension.

Unfortunately tree based models (TreeModel<Regresssor>, IndependentRegressionTreeModel and any ensembles created using them) are more deeply corrupted and a fix would require completely rewriting the tree structure. We recommend users retrain those models in a newer version of Tribuo which contains this fix. Additionally TensorFlow based multidimensional regression models are likely impacted by this issue and should be retrained.

One further additional change in this PR is that XGBoostModel now correctly reports the top features on a per regression dimension basis rather than aggregating them into a single list as it did previously. This was useful for debugging the various fixes to XGBoost.

Motivation

Multidimensional regression models in v4.0 and v4.1.0 have an incorrect mapping between ids and regression dimension names. The root cause is two-fold, due to an implicit contract which was not enforced (which is that the regression id numbers are generated using a lexicographic sort of the regression dimension names). The first cause is that when building an ImmutableRegressionInfo the labels were accidentally put into a non-sorted Map, and then iterated over in that map causing the ids to be hash order dependent, rather than lexicographically sorted. This shouldn't have been enough to break Tribuo, as all id mappings should go through the ImmutableRegressionInfo anyway. The second cause is that they didn't and several trainers/models assumed that the regression values were stored in id order in the Regressor object itself.

The LinearSGDModel and SparseLinearModels (aside from ones generated using the ElasticNetCDTrainer) are correct in older versions of Tribuo. LibLinear, LibSVM and XGBoost models produce the correct output, but are corrupted internally so store the models in the incorrect locations. Tree models are entirely corrupted and should be retrained from scratch.

Note single dimensional regression models are unaffected by this bug as it only occurs when there are two dimensions which could be stored out of order, and we expect those are the vast majority of trained regression models.

… it synchronized and resetting the RNG everywhere.
…hen deserializing old models that have been fixed.
@Craigacp Craigacp added Oracle employee This PR is from an Oracle employee squash-commits Squash the commits when merging this PR labels Sep 27, 2021
Copy link
Member

@pogren pogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the code is updated references to the dimension ids using the updated mapping, a bunch of new/updated unit tests to make sure it all works correctly, new readObject methods, and a rewrite of ImmutableRegressionInfo
some smaller changes include: fixing LinearRegressionType return values for isClassification and isRegression, changing use of LinkedList to ArrayDeque

@Craigacp Craigacp merged commit 823beb4 into main Oct 1, 2021
@Craigacp Craigacp deleted the regression-id-fix branch October 1, 2021 21:44
Craigacp added a commit that referenced this pull request Oct 1, 2021
* Fixing RegressionInfo so the ids are assigned in lexicographic order.

* Fixing the id assignment issue in LibLinearRegressionTrainer.

* Actually fix the issue in LibLinearRegressionTrainer.

* Fix a bug where LibLinearRegressionType reported itself as classification.

* Fixing LibSVMRegressionTrainer.

* Fix XGBoost.

* Fix a bug in standardized multidimensional LibSVM regressions.

* Working on tests for LibSVM regression.

* Adding mapping methods to the regression info.

* Trying a fix for LibLinear.

* Fixing a concurrency and reproducibility issue in liblinear by making it synchronized and resetting the RNG everywhere.

* Tidying up the liblinear tests.

* Fixing TensorFlow.

* Fixing regression trees.

* Updating XGBoost fix.

* Fix for liblinear so models deserialize correctly.

* Adding an example config file for CART regression trees.

* Fixing LibSVM deserialization.

* Fixing ElasticNetCDTrainer as it also emitted corrupted multidimensional models. Adding a test to SGD linear.

* Fixing trees.

* Adding an id test to the regression ensembles.

* Fixing TensorFlow again.

* Fixing the regression SGD test so it is reproducible.

* Fix XGBoost so it re-orders things on deserialization.

* Fix for XGBoost and SLM so they don't re-order the dimensions twice when deserializing old models that have been fixed.

PR Text:
Fixes `ImmutableRegressionInfo` so it generates the regression ids in lexicographic order. It then fixes all the models so they correctly map their dimensions through the `ImmutableRegressionInfo` so even if it's not in lexicographic order then the models produce the correct output. Finally it adds deserialization hooks to `LibLinearRegressionModel`, `LibSVMRegressionModel`, `XGBoostModel` and `SparseLinearModel`s trained using `ElasticNetCDTrainer` so that older models are rewritten on deserialization to fix the issue. Once those models have been loaded by a version of Tribuo with this fix they will work correctly in older versions (though if you then save them again in the older version they will become irretrievably corrupt as it'll lose the flag that tracks if the model has been fixed).

Separately this PR also fixes an issue where multidimensional LibSVM models are corrupted when using the standardization flag. Only the first dimension was stored correctly, the others are copies of that dimension.

Unfortunately tree based models (`TreeModel<Regresssor>`, `IndependentRegressionTreeModel` and any ensembles created using them) are more deeply corrupted and a fix would require completely rewriting the tree structure. We recommend users retrain those models in a newer version of Tribuo which contains this fix. Additionally TensorFlow based multidimensional regression models are likely impacted by this issue and should be retrained.

One further additional change in this PR is that `XGBoostModel` now correctly reports the top features on a per regression dimension basis rather than aggregating them into a single list as it did previously. This was useful for debugging the various fixes to XGBoost.

Multidimensional regression models in v4.0 and v4.1.0 have an incorrect mapping between ids and regression dimension names. The root cause is two-fold, due to an implicit contract which was not enforced (which is that the regression id numbers are generated using a lexicographic sort of the regression dimension names). The first cause is that when building an `ImmutableRegressionInfo` the labels were accidentally put into a non-sorted Map, and then iterated over in that map causing the ids to be hash order dependent, rather than lexicographically sorted. This shouldn't have been enough to break Tribuo, as all id mappings should go through the `ImmutableRegressionInfo` anyway. The second cause is that they didn't and several trainers/models assumed that the regression values were stored in id order in the `Regressor` object itself.

The LinearSGDModel and SparseLinearModels (aside from ones generated using the ElasticNetCDTrainer) are correct in older versions of Tribuo. LibLinear, LibSVM and XGBoost models produce the correct output, but are corrupted internally so store the models in the incorrect locations. Tree models are entirely corrupted and should be retrained from scratch.

Note single dimensional regression models are unaffected by this bug as it only occurs when there are two dimensions which could be stored out of order, and we expect those are the vast majority of trained regression models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Oracle employee This PR is from an Oracle employee squash-commits Squash the commits when merging this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants