Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Rely on BIOM for upstream data manipulation #149

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

wasade
Copy link

@wasade wasade commented Jan 11, 2023

SourceTracker2 regrettably relied on a DataFrame transformation of a biom.Table early in its processing leading to substantial resource requirements stemming from the resulting dense matrix.

This pull request fixes SourceTracker2 to use biom.Table for upstream processing and deferring until the last point for a dense transformation. On a large data, we observed a 2.33x reduction in runtime and a 5.59x reduction in memory used.

The results are qualitatively identical to "vanilla" SourceTracker2 relative to the predicted environments.

Note that three tests are failing in SourceTracker2 master, two of which are almost certainly related to changes in NumPy's random number generator and I suspect are sensitive to the seed. The third test was an actual bug that was fixed while adjusting unit tests

I'm uncertain whether it is technically correct to make the inner loop sparse. The use of alpha1 necessitates a dense matrix as it adds a small prior to all features even the zero'd ones. If that is not necessary, then the inner loop can be made sparse which will yield further gains. Note that the unit tests assume alpha1 is applied everywhere though so it would require modifying some of the unit tests.

cc @wdwvt1 @rob-knight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants