Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of a Reproducibility Framework #185

Merged
merged 42 commits into from
Oct 27, 2021
Merged

Conversation

jwons
Copy link
Contributor

@jwons jwons commented Oct 19, 2021

Description

Adding a reproducibility framework that uses Tribuo's self-describing models and OLCUT to re-train a given model. It also exposes a configuration manager allowing users to modify the configuration before re-training. This framework also can create a diff between two model provenance objects, for the purpose of comparing a model to its reproduced version.

The framework is a new module within Tribuo that is targeted at Java 16. The module contains a new class, ReproUtil.
A ReproUtil must be instantiated with either a Model or ModelProvenance object. Once instantiated, users can access the OLCUT configuration manager using a getter method. Two methods on the ReproUtil object will re-train a model.
Either reproduceFromProvennace or reproduceFromModel. The reproduceFromModel will only execute if the user passed a Model object upon initial instantiation. Both methods will retrain a model from the configuration stored within a the model's provenance; however, reproducing from a model will additionally compare metadata not stored in the provenance (feature and output domains).

Motivation

This is a new feature that allows automatic reproducibility for any Tribuo-trained model. This framework helps increase the reproducibility of data science conducted using Tribuo, and can act as a basis for an experimental framework since the framework control all aspects of a model's configuration.

jwons added 28 commits June 24, 2021 14:41
…m prov with csv, from prov with csv that used traintestsplitterm from a model, and from prov where dataPath is changed
…overrideConfigurableProperties, and removed org.tribuo.* import.
… to set the invocationCount within the sync block. Allows for deterministic BaggingTraining even in parallel.
…ucibility

Fixing the merge conflict caused by a import in KMeansTest
Merge upstream datasource, ONNX, and test changes to reproducibility framework
…tsplitter since splitter is not configurable
…d specialized code for changing some component configurations in repro framework now that ConfigurationManager has a getter, using set operations on provenance objects keys to be more robust to different prov objects, add objects to the diff that are NOT in the interesection of two objects, add reproducibility test for CSVLoader, add test to do a prov diff where one model was trained with a transformTrainer
…ationCount implementations where the rng had to be split. Added tests in Reproducibility for Bagging Trainer. Rolled back OLCUT version requirement so it can be merged in from main branch.
…t changes for reproducibility and changes to LibSVMTrainer and documentation
Merging the conflicts from pulling from oracle/main
…in the prov are now set, added some future work comments and other documentation, and switched the diff to apply 'original' and 'reproduced' to only primitive values
…pendentMultiLabelTrainer based on Adam and Jack's bugfix
@Craigacp Craigacp self-assigned this Oct 20, 2021
@Craigacp Craigacp added the OCA signed This PR is from a person/organisation who has signed the OCA label Oct 20, 2021
@Craigacp Craigacp added the squash-commits Squash the commits when merging this PR label Oct 20, 2021
Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean up the imports, exceptions and variable names. I have some more detailed comments about the ReproUtil class which we should follow up in the threads.

Core/src/main/java/org/tribuo/Trainer.java Outdated Show resolved Hide resolved
Core/src/main/java/org/tribuo/Trainer.java Show resolved Hide resolved
MutableDataset datasetFromCSV = new MutableDataset<Label>(csvSource);

LogisticRegressionTrainer trainer = new LogisticRegressionTrainer();
LinearSGDModel model = (LinearSGDModel) trainer.train(datasetFromCSV);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should just fix LogisticRegressionTrainer so LinearSGDTrainer returns a LinearSGDModel.

@Craigacp
Copy link
Member

So it looks like the tests directly pass on Java 16 which is good, and I'll poke at the CI a little to figure that out.

Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more small things.

val_diff.put(NEW, ((PrimitiveProvenance<?>) provMapB.get(key)).getValue().toString());
report.set(key, val_diff);
ObjectNode valDiff = mapper.createObjectNode();
valDiff.put(OLD, ((PrimitiveProvenance<?>) provMapA.get(key)).getValue().toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern match? provenanceA instanceof PrimitiveProvenance<?> primProvA then use primProvA to make the valDiff.put shorter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now pattern matches

Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small thing on the exception signatures. I'll work on the CI today then we can get that fix in and merge.

jwons added 3 commits October 26, 2021 14:25
…lude runtime exceptions. DiffProvenance string now written by ObjectMapper
Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small things, then I think we're done. I'll tidy up the javadoc once this has merged, and I'll work on the CI tomorrow morning.

@@ -371,14 +435,25 @@ public ConfigurationManager getConfigurationManager(){

return newModel;
}

public record FeatureDiff (Set originalFeatures, Set reproducedFeatures){}
public record OutputDiff (Set originalOutput, Set reproducedOutput){}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sets need a generic type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are the right types:

public record FeatureDiff (Set<String> originalFeatures, Set<String> reproducedFeatures){}
public record OutputDiff<T extends Output>(Set<T> originalOutput, Set<T> reproducedOutput){}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<T extends Output<T>>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@Craigacp Craigacp marked this pull request as ready for review October 27, 2021 21:13
Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Joe.

@Craigacp Craigacp merged commit 57c4c1e into oracle:main Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA signed This PR is from a person/organisation who has signed the OCA squash-commits Squash the commits when merging this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants