-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removes Apache Commons Math #241
Conversation
…rates ClusteringDataGenerator and GaussianClusterDataSource over to use it.
… dense la package.
…actoring it's now a lot faster.
…yFactorization tests, and other small bits of tidying.
…arseMatrix. Adding solve and inverse methods to EigenDecomposition along with tests.
@@ -184,4 +185,34 @@ public static <T extends Output<T>> void testSequenceModelSerialization(Sequence | |||
Assertions.fail("Failed to deserialize sequence model class " + model.getClass().toString(), ex); | |||
} | |||
} | |||
|
|||
public static boolean topFeaturesEqual(Map<String, List<Pair<String,Double>>> first, Map<String, List<Pair<String,Double>>> second, double tolerance) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method adds 30+ lines of code but is never used. Probably harmless given that its in test code - but still a bit suspicious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I frequently use that kind of method when I'm comparing implementations, but I kept taking it back out again and then having to remember how to re-implement it (or fake it up using watches in the debugger). So this time I just committed it. I agree it's unused most of the time which is annoying.
* @param second The second vector. | ||
* @return The expected mutual information under a hypergeometric distribution. | ||
*/ | ||
public static <T> double expectedMI(List<T> first, List<T> second) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was formerly a private method in ClusteringMetrics and is now a public method here - seems like it deserves a unit test. Three options seems reasonable - generate some test data using a method from a similar library (suggestions welcome!), generate some regression data with this method and test on that (tautological, I suppose, but at least you will know if the behavior changes), or work out some simple examples by hand. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can work out some examples to test.
…over to record style accessors.
fixes discrepancy in expectedMI due to incomplete iteration in loop
* @param value The observed value. | ||
* @return The cumulative probability of the observed value. | ||
*/ | ||
private static double computeChiSquaredProbability(int degreesOfFreedom, double value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI - this method does not work for odd values of degreesOfFreedom
.
from scipy.stats import chi2
print(chi2.cdf(3.84,3))
maybe that's why this method is private....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works fine for odd values, btw.
@Test
void testComputeChiSquaredProbability() throws Exception {
// assertEquals(0.9499564787512949, InformationTheory.computeChiSquaredProbability(1, 3.84), 1e-14);
assertEquals(0.8533930378696499, InformationTheory.computeChiSquaredProbability(2, 3.84), 1e-14);
// assertEquals(0.7207323828813903, InformationTheory.computeChiSquaredProbability(3, 3.84), 1e-14);
assertEquals(0.5719076705793775, InformationTheory.computeChiSquaredProbability(4, 3.84), 1e-14);
assertEquals(0.9995692574594243, InformationTheory.computeChiSquaredProbability(2, 15.5), 1e-14);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I need it to compute the GTest but don't want to expose it out as I've not checked that it's valid for other use cases. I'd like to add a fuller stats package to Tribuo at some point, but in the meantime these functions will just appear as private things that we need for various hypothesis tests that we plan to add.
eigendecomposition
fixes discrepancy in expectedMI due to incomplete iteration in loop
left in NaN check though so tests would still pass.
Commons math removal pvo review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ClusteringMetrics is a straightforward refactor that uses the same expectedMI method as before except now it is a public method in InformationTheory - added test for adjustedMI which helped uncover a discrepancy in expectedMI which was updated.
- InformationTheory added test for mi and entropy
- ClusteringDataGenerator and GaussianClusterDataSource are straightforward refactors assuming the new MultivariateNormalDistribution is a good/equivalent substitution for the ACM impl.
- SparseLinearModel just changes a couple parameter names and adds a bit of javadoc.
- DenseVector I added a unit test for meanVariance() and the new reduce method that takes a BiFunction parameter.
- Gamma - added unit test for Gamma.gamma using output from scipy and various examples
- DenseMatrix - added additional unit tests for lu factorization, cholensky, and eigendecomp, setColumn, selectColumn,
- requested change to LUFactorization.l and .u to lower and upper
- DenseSparseMatrix - added simple unit tests for createIdentity, createDiagonal, getColumn
- Matrix interface looks fine.
- The trainers all look like they have been straightforwardly refactored to use the new apis and the tests demonstrate that you can kick them into motion and they do something reasonable.
Description
Add replacement functionality for the things Tribuo used from Apache Commons Math 3.6.1 & migrate all Tribuo uses over to that functionality.
Replaced functionality:
As a result there has been a slight rewrite of the clustering data generators, some methods in the information theory package, and a more drastic rewrite of the sparse linear models code which extensively used ACM's linear algebra package.
The various user facing methods produce approximately the same answers as before, though there are small differences at the level of numerical precision due to different algorithms or the ordering of floating point operations. The sparse linear models code is now much faster as it properly caches the matrix inverse used in several parts of the algorithm.
Note applications which transitively pulled in Apache Commons Math via Tribuo will now need to add an explicit dependency on it.
Motivation
Migrating away from commons math will make our release process simpler, and migrating to our own linear algebra everywhere means when we vectorise it everything in Tribuo will benefit. Also the work to migrate the sparse linear models to Tribuo's linear algebra made it easy to spot that we inverted the same matrix three times, and now we only do it once so it is much faster.