Removes Apache Commons Math #241

Craigacp · 2022-06-13T16:29:30Z

Description

Add replacement functionality for the things Tribuo used from Apache Commons Math 3.6.1 & migrate all Tribuo uses over to that functionality.

Replaced functionality:

Gamma function used to compute normalized mutual information & chi squared CDF. The gamma function is ported to Java from fdlibm.
Cholesky factorization used when sampling from multivariate normal.
LU factorization used when building sparse linear models.
Eigenvalue decomposition used when sampling from multivariate normal (to preserve compatibility with Apache Commons Math's multivariate normal sampler).
Multivariate Gaussian distribution (at the moment only supports sampling, but we will look at adding pdf computation later).

As a result there has been a slight rewrite of the clustering data generators, some methods in the information theory package, and a more drastic rewrite of the sparse linear models code which extensively used ACM's linear algebra package.

The various user facing methods produce approximately the same answers as before, though there are small differences at the level of numerical precision due to different algorithms or the ordering of floating point operations. The sparse linear models code is now much faster as it properly caches the matrix inverse used in several parts of the algorithm.

Note applications which transitively pulled in Apache Commons Math via Tribuo will now need to add an explicit dependency on it.

Motivation

Migrating away from commons math will make our release process simpler, and migrating to our own linear algebra everywhere means when we vectorise it everything in Tribuo will benefit. Also the work to migrate the sparse linear models to Tribuo's linear algebra made it easy to spot that we inverted the same matrix three times, and now we only do it once so it is much faster.

…rates ClusteringDataGenerator and GaussianClusterDataSource over to use it.

… dense la package.

…actoring it's now a lot faster.

…yFactorization tests, and other small bits of tidying.

…arseMatrix. Adding solve and inverse methods to EigenDecomposition along with tests.

pogren · 2022-07-22T15:24:46Z

Core/src/test/java/org/tribuo/test/Helpers.java

@@ -184,4 +185,34 @@ public static <T extends Output<T>> void testSequenceModelSerialization(Sequence
            Assertions.fail("Failed to deserialize sequence model class " + model.getClass().toString(), ex);
        }
    }
+
+    public static boolean topFeaturesEqual(Map<String, List<Pair<String,Double>>> first, Map<String, List<Pair<String,Double>>> second, double tolerance)  {


This method adds 30+ lines of code but is never used. Probably harmless given that its in test code - but still a bit suspicious.

Ahh, I frequently use that kind of method when I'm comparing implementations, but I kept taking it back out again and then having to remember how to re-implement it (or fake it up using watches in the debugger). So this time I just committed it. I agree it's unused most of the time which is annoying.

pogren · 2022-07-22T15:27:41Z

Util/InformationTheory/src/main/java/org/tribuo/util/infotheory/InformationTheory.java

+     * @param second The second vector.
+     * @return The expected mutual information under a hypergeometric distribution.
+     */
+    public static <T> double expectedMI(List<T> first, List<T> second) {


This method was formerly a private method in ClusteringMetrics and is now a public method here - seems like it deserves a unit test. Three options seems reasonable - generate some test data using a method from a similar library (suggestions welcome!), generate some regression data with this method and test on that (tautological, I suppose, but at least you will know if the behavior changes), or work out some simple examples by hand. Thoughts?

Sure, I can work out some examples to test.

Math/src/main/java/org/tribuo/math/la/DenseMatrix.java

…over to record style accessors.

fixes discrepancy in expectedMI due to incomplete iteration in loop

pogren · 2022-07-28T17:06:28Z

Util/InformationTheory/src/main/java/org/tribuo/util/infotheory/InformationTheory.java

+     * @param value The observed value.
+     * @return The cumulative probability of the observed value.
+     */
+    private static double computeChiSquaredProbability(int degreesOfFreedom, double value) {


FYI - this method does not work for odd values of degreesOfFreedom.

from scipy.stats import chi2
print(chi2.cdf(3.84,3))

maybe that's why this method is private....

works fine for odd values, btw.

@Test
void testComputeChiSquaredProbability() throws Exception {
// assertEquals(0.9499564787512949, InformationTheory.computeChiSquaredProbability(1, 3.84), 1e-14);
assertEquals(0.8533930378696499, InformationTheory.computeChiSquaredProbability(2, 3.84), 1e-14);
// assertEquals(0.7207323828813903, InformationTheory.computeChiSquaredProbability(3, 3.84), 1e-14);
assertEquals(0.5719076705793775, InformationTheory.computeChiSquaredProbability(4, 3.84), 1e-14);
assertEquals(0.9995692574594243, InformationTheory.computeChiSquaredProbability(2, 15.5), 1e-14);
}

Yeah, I need it to compute the GTest but don't want to expose it out as I've not checked that it's valid for other use cases. I'd like to add a fuller stats package to Tribuo at some point, but in the meantime these functions will just appear as private things that we need for various hypothesis tests that we plan to add.

eigendecomposition

fixes discrepancy in expectedMI due to incomplete iteration in loop

left in NaN check though so tests would still pass.

Commons math removal pvo review

pogren

ClusteringMetrics is a straightforward refactor that uses the same expectedMI method as before except now it is a public method in InformationTheory - added test for adjustedMI which helped uncover a discrepancy in expectedMI which was updated.
InformationTheory added test for mi and entropy
ClusteringDataGenerator and GaussianClusterDataSource are straightforward refactors assuming the new MultivariateNormalDistribution is a good/equivalent substitution for the ACM impl.
SparseLinearModel just changes a couple parameter names and adds a bit of javadoc.
DenseVector I added a unit test for meanVariance() and the new reduce method that takes a BiFunction parameter.
Gamma - added unit test for Gamma.gamma using output from scipy and various examples
DenseMatrix - added additional unit tests for lu factorization, cholensky, and eigendecomp, setColumn, selectColumn,
- requested change to LUFactorization.l and .u to lower and upper
DenseSparseMatrix - added simple unit tests for createIdentity, createDiagonal, getColumn
Matrix interface looks fine.
The trainers all look like they have been straightforwardly refactored to use the new apis and the tests demonstrate that you can kick them into motion and they do something reasonable.

Craigacp added 17 commits May 11, 2022 09:52

Adding Cholesky factorization & Multivariate normal for sampling. Mig…

b3531d3

…rates ClusteringDataGenerator and GaussianClusterDataSource over to use it.

Roughing out LU factorization.

954c820

Implementing LU factorization and solver methods.

7b2257c

Adding a test helper for comparing top feature maps and tidied up the…

ae7689b

… dense la package.

Removing commons math from Regression/SLM. Due to some additional ref…

5026ab7

…actoring it's now a lot faster.

Stubbing out eigen decomposition.

56dce9a

Initial eigen decomposition, still buggy.

323e83c

Fixing the eigen decomposition.

60a13e5

Commons math begone!

43e24e2

Adding CholeskyFactorization.solve implementations, improving Cholesk…

accae2e

…yFactorization tests, and other small bits of tidying.

Promoting DenseMatrix.getColumn to Matrix, implementing it on DenseSp…

af83956

…arseMatrix. Adding solve and inverse methods to EigenDecomposition along with tests.

Renaming org.tribuo.math.rng to org.tribuo.math.distributions

b6d359e

Improving the docs for the la factorization methods.

8ba663b

Removing commons math from the third party licenses.

d2be6a2

Cleanups.

1a36117

Switching to Arrays.fill to zero part of a matrix.

799e7ee

Removing a dependent load.

68f1eb1

Craigacp added the Oracle employee This PR is from an Oracle employee label Jun 13, 2022

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jun 13, 2022

pogren reviewed Jul 22, 2022

View reviewed changes

Math/src/main/java/org/tribuo/math/la/DenseMatrix.java Outdated Show resolved Hide resolved

Craigacp commented Jul 22, 2022

View reviewed changes

Math/src/main/java/org/tribuo/math/la/DenseMatrix.java Show resolved Hide resolved

Craigacp commented Jul 22, 2022

View reviewed changes

Math/src/main/java/org/tribuo/math/la/DenseMatrix.java Show resolved Hide resolved

Craigacp commented Jul 22, 2022

View reviewed changes

Math/src/main/java/org/tribuo/math/la/DenseMatrix.java Outdated Show resolved Hide resolved

Craigacp and others added 5 commits July 22, 2022 23:13

Adding an interface for factorizations, migrating the factorizations …

c04eda5

…over to record style accessors.

initial commit of GammaTest

556134f

added testReductionBiFunction and testMeanVariance to DenseVectorTest

3cd120f

added some addition tests/assertions for cholesky and lu factorization.

af33f87

fixes issue with test matrix for cholensky factorization.

63ee5dd

pogren added 7 commits July 27, 2022 09:36

cleaning up, standardizing, filling out factorization/decomp tests

6d6dc3f

added test for createIdentity and createDiagonal

ed6aa12

added simple test for DenseSparseMatrix.getColumn

9eb0033

ClusteringMetrics.adjustedMI produces same values as sklearn

b7c0abb

fixes discrepancy in expectedMI due to incomplete iteration in loop

add delta to unit test for mi

f8b2966

comments demonstrating generating test values in numpy/scypy/sklearn

9595871

added comment showing how to generate test

10818c1

pogren reviewed Jul 28, 2022

View reviewed changes

pogren and others added 19 commits July 28, 2022 11:54

initial commit of GammaTest

742e304

added testReductionBiFunction and testMeanVariance to DenseVectorTest

afc81a6

added some addition tests/assertions for cholesky and lu factorization.

6a6e464

fixes issue with test matrix for cholensky factorization.

25c7146

adding some python code to help generate a unit test for

d1c6f9c

eigendecomposition

added printMatrixPythonFriendly to DenseMatrix

5f77d74

added unit testing for eigendecomposition, setColumn, and selectColumns

4c76b61

cleaning up, standardizing, filling out factorization/decomp tests

5a1bd8a

added test for createIdentity and createDiagonal

89addb3

added simple test for DenseSparseMatrix.getColumn

c93cdd3

ClusteringMetrics.adjustedMI produces same values as sklearn

b0e472b

fixes discrepancy in expectedMI due to incomplete iteration in loop

add delta to unit test for mi

70e51f9

comments demonstrating generating test values in numpy/scypy/sklearn

d91004d

added comment showing how to generate test

c228de1

fixes compile errors

6fabb6f

resolves merge conflict

4d2aa61

reverted adjustedMI to using 'min' approach for calculating denominator

7108a8b

left in NaN check though so tests would still pass.

Merge pull request #2 from pogren/commons-math-removal-pvo-review

560384f

Commons math removal pvo review

Fixing licensing information.

717ff81

pogren approved these changes Jul 28, 2022

View reviewed changes

Craigacp merged commit faf9179 into oracle:main Jul 28, 2022

Craigacp deleted the commons-math-removal branch July 28, 2022 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removes Apache Commons Math #241

Removes Apache Commons Math #241

Craigacp commented Jun 13, 2022

pogren Jul 22, 2022

Craigacp Jul 22, 2022

pogren Jul 22, 2022

Craigacp Jul 22, 2022

pogren Jul 28, 2022

pogren Jul 28, 2022

Craigacp Jul 28, 2022

pogren left a comment

Removes Apache Commons Math #241

Removes Apache Commons Math #241

Conversation

Craigacp commented Jun 13, 2022

Description

Motivation

pogren Jul 22, 2022

Choose a reason for hiding this comment

Craigacp Jul 22, 2022

Choose a reason for hiding this comment

pogren Jul 22, 2022

Choose a reason for hiding this comment

Craigacp Jul 22, 2022

Choose a reason for hiding this comment

pogren Jul 28, 2022

Choose a reason for hiding this comment

pogren Jul 28, 2022

Choose a reason for hiding this comment

Craigacp Jul 28, 2022

Choose a reason for hiding this comment

pogren left a comment

Choose a reason for hiding this comment