Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removes Apache Commons Math #241

Merged
merged 51 commits into from
Jul 28, 2022
Merged

Conversation

Craigacp
Copy link
Member

Description

Add replacement functionality for the things Tribuo used from Apache Commons Math 3.6.1 & migrate all Tribuo uses over to that functionality.

Replaced functionality:

  • Gamma function used to compute normalized mutual information & chi squared CDF. The gamma function is ported to Java from fdlibm.
  • Cholesky factorization used when sampling from multivariate normal.
  • LU factorization used when building sparse linear models.
  • Eigenvalue decomposition used when sampling from multivariate normal (to preserve compatibility with Apache Commons Math's multivariate normal sampler).
  • Multivariate Gaussian distribution (at the moment only supports sampling, but we will look at adding pdf computation later).

As a result there has been a slight rewrite of the clustering data generators, some methods in the information theory package, and a more drastic rewrite of the sparse linear models code which extensively used ACM's linear algebra package.

The various user facing methods produce approximately the same answers as before, though there are small differences at the level of numerical precision due to different algorithms or the ordering of floating point operations. The sparse linear models code is now much faster as it properly caches the matrix inverse used in several parts of the algorithm.

Note applications which transitively pulled in Apache Commons Math via Tribuo will now need to add an explicit dependency on it.

Motivation

Migrating away from commons math will make our release process simpler, and migrating to our own linear algebra everywhere means when we vectorise it everything in Tribuo will benefit. Also the work to migrate the sparse linear models to Tribuo's linear algebra made it easy to spot that we inverted the same matrix three times, and now we only do it once so it is much faster.

@Craigacp Craigacp added the Oracle employee This PR is from an Oracle employee label Jun 13, 2022
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jun 13, 2022
@@ -184,4 +185,34 @@ public static <T extends Output<T>> void testSequenceModelSerialization(Sequence
Assertions.fail("Failed to deserialize sequence model class " + model.getClass().toString(), ex);
}
}

public static boolean topFeaturesEqual(Map<String, List<Pair<String,Double>>> first, Map<String, List<Pair<String,Double>>> second, double tolerance) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method adds 30+ lines of code but is never used. Probably harmless given that its in test code - but still a bit suspicious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I frequently use that kind of method when I'm comparing implementations, but I kept taking it back out again and then having to remember how to re-implement it (or fake it up using watches in the debugger). So this time I just committed it. I agree it's unused most of the time which is annoying.

* @param second The second vector.
* @return The expected mutual information under a hypergeometric distribution.
*/
public static <T> double expectedMI(List<T> first, List<T> second) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was formerly a private method in ClusteringMetrics and is now a public method here - seems like it deserves a unit test. Three options seems reasonable - generate some test data using a method from a similar library (suggestions welcome!), generate some regression data with this method and test on that (tautological, I suppose, but at least you will know if the behavior changes), or work out some simple examples by hand. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can work out some examples to test.

* @param value The observed value.
* @return The cumulative probability of the observed value.
*/
private static double computeChiSquaredProbability(int degreesOfFreedom, double value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - this method does not work for odd values of degreesOfFreedom.

from scipy.stats import chi2
print(chi2.cdf(3.84,3))

maybe that's why this method is private....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works fine for odd values, btw.

@Test
void testComputeChiSquaredProbability() throws Exception {
// assertEquals(0.9499564787512949, InformationTheory.computeChiSquaredProbability(1, 3.84), 1e-14);
assertEquals(0.8533930378696499, InformationTheory.computeChiSquaredProbability(2, 3.84), 1e-14);
// assertEquals(0.7207323828813903, InformationTheory.computeChiSquaredProbability(3, 3.84), 1e-14);
assertEquals(0.5719076705793775, InformationTheory.computeChiSquaredProbability(4, 3.84), 1e-14);
assertEquals(0.9995692574594243, InformationTheory.computeChiSquaredProbability(2, 15.5), 1e-14);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I need it to compute the GTest but don't want to expose it out as I've not checked that it's valid for other use cases. I'd like to add a fuller stats package to Tribuo at some point, but in the meantime these functions will just appear as private things that we need for various hypothesis tests that we plan to add.

Copy link
Member

@pogren pogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ClusteringMetrics is a straightforward refactor that uses the same expectedMI method as before except now it is a public method in InformationTheory - added test for adjustedMI which helped uncover a discrepancy in expectedMI which was updated.
  • InformationTheory added test for mi and entropy
  • ClusteringDataGenerator and GaussianClusterDataSource are straightforward refactors assuming the new MultivariateNormalDistribution is a good/equivalent substitution for the ACM impl.
  • SparseLinearModel just changes a couple parameter names and adds a bit of javadoc.
  • DenseVector I added a unit test for meanVariance() and the new reduce method that takes a BiFunction parameter.
  • Gamma - added unit test for Gamma.gamma using output from scipy and various examples
  • DenseMatrix - added additional unit tests for lu factorization, cholensky, and eigendecomp, setColumn, selectColumn,
    • requested change to LUFactorization.l and .u to lower and upper
  • DenseSparseMatrix - added simple unit tests for createIdentity, createDiagonal, getColumn
  • Matrix interface looks fine.
  • The trainers all look like they have been straightforwardly refactored to use the new apis and the tests demonstrate that you can kick them into motion and they do something reasonable.

@Craigacp Craigacp merged commit faf9179 into oracle:main Jul 28, 2022
@Craigacp Craigacp deleted the commons-math-removal branch July 28, 2022 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement. Oracle employee This PR is from an Oracle employee
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants