PERF: Rely on BIOM for upstream data manipulation #149
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
SourceTracker2 regrettably relied on a
DataFrame
transformation of abiom.Table
early in its processing leading to substantial resource requirements stemming from the resulting dense matrix.This pull request fixes SourceTracker2 to use
biom.Table
for upstream processing and deferring until the last point for a dense transformation. On a large data, we observed a 2.33x reduction in runtime and a 5.59x reduction in memory used.The results are qualitatively identical to "vanilla" SourceTracker2 relative to the predicted environments.
Note that three tests are failing in SourceTracker2 master, two of which are almost certainly related to changes in NumPy's random number generator and I suspect are sensitive to the seed. The third test was an actual bug that was fixed while adjusting unit tests
I'm uncertain whether it is technically correct to make the inner loop sparse. The use of
alpha1
necessitates a dense matrix as it adds a small prior to all features even the zero'd ones. If that is not necessary, then the inner loop can be made sparse which will yield further gains. Note that the unit tests assumealpha1
is applied everywhere though so it would require modifying some of the unit tests.cc @wdwvt1 @rob-knight