-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addition of a Reproducibility Framework #185
Conversation
…elProvenance or Model object
…en so data paths can be changed
…m prov with csv, from prov with csv that used traintestsplitterm from a model, and from prov where dataPath is changed
…overrideConfigurableProperties, and removed org.tribuo.* import.
… to set the invocationCount within the sync block. Allows for deterministic BaggingTraining even in parallel.
…cts stored as lists
…ucibility Fixing the merge conflict caused by a import in KMeansTest
Merge upstream datasource, ONNX, and test changes to reproducibility framework
…dels when creating reproducibility object
…ve warnings related to poor use of generics
…tsplitter since splitter is not configurable
…d specialized code for changing some component configurations in repro framework now that ConfigurationManager has a getter, using set operations on provenance objects keys to be more robust to different prov objects, add objects to the diff that are NOT in the interesection of two objects, add reproducibility test for CSVLoader, add test to do a prov diff where one model was trained with a transformTrainer
…ationCount implementations where the rng had to be split. Added tests in Reproducibility for Bagging Trainer. Rolled back OLCUT version requirement so it can be merged in from main branch.
…t changes for reproducibility and changes to LibSVMTrainer and documentation
Merging the conflicts from pulling from oracle/main
…in the prov are now set, added some future work comments and other documentation, and switched the diff to apply 'original' and 'reproduced' to only primitive values
… not be reproduced
…pendentMultiLabelTrainer based on Adam and Jack's bugfix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clean up the imports, exceptions and variable names. I have some more detailed comments about the ReproUtil
class which we should follow up in the threads.
...sification/Core/src/main/java/org/tribuo/classification/baseline/DummyClassifierTrainer.java
Outdated
Show resolved
Hide resolved
Classification/DecisionTree/src/test/java/org/tribuo/classification/dtree/TestCART.java
Outdated
Show resolved
Hide resolved
Reproducibility/src/main/test/java/org.tribuo.reproducibility/ReproUtilTest.java
Outdated
Show resolved
Hide resolved
Reproducibility/src/main/test/java/org.tribuo.reproducibility/ReproUtilTest.java
Outdated
Show resolved
Hide resolved
Reproducibility/src/main/test/java/org.tribuo.reproducibility/ReproUtilTest.java
Outdated
Show resolved
Hide resolved
MutableDataset datasetFromCSV = new MutableDataset<Label>(csvSource); | ||
|
||
LogisticRegressionTrainer trainer = new LogisticRegressionTrainer(); | ||
LinearSGDModel model = (LinearSGDModel) trainer.train(datasetFromCSV); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should just fix LogisticRegressionTrainer
so LinearSGDTrainer
returns a LinearSGDModel
.
Reproducibility/src/main/test/java/org.tribuo.reproducibility/ReproUtilTest.java
Outdated
Show resolved
Hide resolved
So it looks like the tests directly pass on Java 16 which is good, and I'll poke at the CI a little to figure that out. |
…nused imports, and more.
…static block, can add pretty printing options now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more small things.
Reproducibility/src/main/java/org/tribuo/reproducibility/ReproUtil.java
Outdated
Show resolved
Hide resolved
Reproducibility/src/main/java/org/tribuo/reproducibility/ReproUtil.java
Outdated
Show resolved
Hide resolved
Reproducibility/src/main/java/org/tribuo/reproducibility/ReproUtil.java
Outdated
Show resolved
Hide resolved
val_diff.put(NEW, ((PrimitiveProvenance<?>) provMapB.get(key)).getValue().toString()); | ||
report.set(key, val_diff); | ||
ObjectNode valDiff = mapper.createObjectNode(); | ||
valDiff.put(OLD, ((PrimitiveProvenance<?>) provMapA.get(key)).getValue().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pattern match? provenanceA instanceof PrimitiveProvenance<?> primProvA
then use primProvA
to make the valDiff.put
shorter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now pattern matches
Reproducibility/src/main/test/java/org.tribuo.reproducibility/ReproUtilTest.java
Show resolved
Hide resolved
Reproducibility/src/main/test/java/org.tribuo.reproducibility/ReproUtilTest.java
Outdated
Show resolved
Hide resolved
… licenses in repro module, cleaned up exceptions, fixed imports, and other small code review changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small thing on the exception signatures. I'll work on the CI today then we can get that fix in and merge.
Reproducibility/src/main/java/org/tribuo/reproducibility/ReproUtil.java
Outdated
Show resolved
Hide resolved
…lude runtime exceptions. DiffProvenance string now written by ObjectMapper
…ure diff, output diff, and provenance diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small things, then I think we're done. I'll tidy up the javadoc once this has merged, and I'll work on the CI tomorrow morning.
Reproducibility/src/main/java/org/tribuo/reproducibility/ReproUtil.java
Outdated
Show resolved
Hide resolved
@@ -371,14 +435,25 @@ public ConfigurationManager getConfigurationManager(){ | |||
|
|||
return newModel; | |||
} | |||
|
|||
public record FeatureDiff (Set originalFeatures, Set reproducedFeatures){} | |||
public record OutputDiff (Set originalOutput, Set reproducedOutput){} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sets need a generic type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are the right types:
public record FeatureDiff (Set<String> originalFeatures, Set<String> reproducedFeatures){}
public record OutputDiff<T extends Output>(Set<T> originalOutput, Set<T> reproducedOutput){}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<T extends Output<T>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Joe.
Description
Adding a reproducibility framework that uses Tribuo's self-describing models and OLCUT to re-train a given model. It also exposes a configuration manager allowing users to modify the configuration before re-training. This framework also can create a diff between two model provenance objects, for the purpose of comparing a model to its reproduced version.
The framework is a new module within Tribuo that is targeted at Java 16. The module contains a new class,
ReproUtil
.A
ReproUtil
must be instantiated with either aModel
orModelProvenance
object. Once instantiated, users can access the OLCUT configuration manager using a getter method. Two methods on theReproUtil
object will re-train a model.Either
reproduceFromProvennace
orreproduceFromModel
. ThereproduceFromModel
will only execute if the user passed aModel
object upon initial instantiation. Both methods will retrain a model from the configuration stored within a the model's provenance; however, reproducing from a model will additionally compare metadata not stored in the provenance (feature and output domains).Motivation
This is a new feature that allows automatic reproducibility for any Tribuo-trained model. This framework helps increase the reproducibility of data science conducted using Tribuo, and can act as a basis for an experimental framework since the framework control all aspects of a model's configuration.