-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major overhaul of internals #51
Conversation
@wdwvt1, ok if I get to this next week? I won't have time tomorrow. |
@gregcaporaso - yeah, absolutely. |
prediction and leave-one-out (LOO) classification. These calls are | ||
``_gibbs`` and ``_gibbs_loo``. | ||
* The per-sink feature assignments are recorded for every run and written to | ||
the output directory. They are named as X.contingency.txt where X is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X.full_results.txt
would match Dan's original nomenclature from his most recent release, and is a bit more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to change this, but let me give my rationale for this.
X.full_results.txt
doesn't suggest what the object is. The full output is a contingency table of sources X features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. We've been moving away from contingency table toward feature table. What if it was X.table.txt
? If you do this, you should indicate that these are the same as the X.full_results.txt
files from SourceTracker 1.
@wdwvt1 I took a pass, but we need @gregcaporaso for the main review |
Sorry guys, I'll review on Monday at the latest. |
|
||
In brief, this script requires a feature X sample contingency table | ||
(traditionally an OTU table) and sample metadata (traditionally a mapping file). | ||
Any feature table that can be read by the `biom` package will be acceptable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying a version for biom here is important (since different versions could read different files). I recommend changing this to:
that can be read by biom-format >= 2.1.4 < 2.2.0
will be ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything above 2.1.4 should work. Changed to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that backward compatibility won't be guaranteed >= 2.2.0, so be sure to include that as an upper boundary.
@gregcaporaso @johnchase @lkursell If you want to take another pass that'd be great. This commit removes the intermediate file writing - everything is stored in memory now. This is much easier to handle in a lot of cases, but in some of my large simulations I have been seeing segmentation faults. Can you guys please use this branch in your analyses and tell me if you are getting them? |
Simulations with a large data set revealed a heinous memory leak: 5 minutes of run time with a large enough table would end up requesting 80gb of memory. The source of the memory leak appeared to be a cyclic reference in the To resolve these problems I eliminated the Benchmarks with this PR show that you can now do a ST2 run of 100 sinks, 350 sources, 8000 features, 10 restarts, 10 draws/restart, 100 burnins, in 30 minutes when using 6 cores, without exceeding 1gb of total memory use. @gregcaporaso can you please review? @johnchase and @lkursell - your review would also be appreciated if you have time. |
>>> print(ts) | ||
np.array([0, 0, 0, 0, 3, 3, 3, 3, 3, 4, 4]) | ||
''' | ||
taxon_sequence = np.zeros(int(sink_data.sum()), dtype=np.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring indicates sink_data
is int
, so is the cast necessary?
why np.int32
?
note, given the additional cast in the forloop, it may make sense to just do:
sink_data = sink_data.astype(int)
...as the first thing in the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@wasade - thanks for taking a look! The You are absolutely write about the cythonization. The vast majority of code runtime is consumed in the inner loop (primarily in the updating the |
Ah, makes sense with int32. It shouldn't be a problem to keep. Do you have a profile to share? I'd be happy to help with some cythonizing On Aug 20, 2016 17:42, "Will Van Treuren" [email protected] wrote:
|
@@ -157,6 +174,10 @@ rarefaction depth of 1500** | |||
**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs** | |||
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5` | |||
|
|||
**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs. Write the full output.** | |||
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this command is missing --full_output
(it's the same as the command above).
Maybe for a future PR, but could we be more descriptive than --full_output
? It'd be nice if this parameter described what the output was (and I think that's more important than matching what ST1 did).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you make this -o example7
? This makes it easier to run the commands back-to-back (which I do as part of testing, but which we'll ultimately want to automate in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I just noticed a typo in example 5 - the text should say depth of 2500. I know that wasn't part of this PR, but could you fix that when you make these other changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed these three. I added more description as part of a significant update of the README.md
.
@wdwvt1 I spent some time on this today and it's looking reasonable. I didn't spend as much time as I did on my first pass, but if there are specific areas that you want me to look at in more detail I can do that. Just let me know what those are. I'd like to look through this a little more tomorrow - in particular I want to spend more time reviewing the test code. Have you gone through the previous comments to make sure that you've addressed everything? It's hard to tell since so much changed since my last review. Also, have you confirmed that all new methods and new functionality now have tests? I think it might be worth starting to compute test coverage on this repo. I started experimenting with this in another pull request. |
Thanks - I'll fix the documentation, typos, and camel case. Have fixed previous comments except for the mapping file loading I'm mainly interested in a review of the changes to the internal code, On Aug 22, 2016 5:22 PM, "Greg Caporaso" [email protected] wrote:
|
Also yes to addition of tests. Coverage testing sounds good, though I think if I had to choose between On Aug 22, 2016 5:55 PM, "Will Van Treuren" [email protected] wrote:
|
Can you link me to the implementation that you'd like me to compare against? I'm not very familiar with that code. I can do that in addition to looking at the tests. |
…tprint and improve the candidate API calls. The internal containers for data are now pandas DataFrames, significantly reducing parsing/indexing code. The README.md has been updated to reflect the new API access.
Thanks @wdwvt1! As discussed on our call today, we can deal with the remaining items that were brought up here through the new issues that we created. |
This PR overhauls the ST2 code to do several important things:
_gibbs
and_gibbs_loo
._gibbs_loo
method - this is the API for leave-one-out calls and can now be run in parallel in the same manner that_gibbs
can.pandas
- much of the code dealing with e.g. nonoverlapping metadata and feature sample sets has been moved to simplepandas
operations.There are a variety of bug fixes as well (no more nans, checks for fractional count tables, etc.)
The test examples work and the test code passes, but this is a significant update so if ya'll could give a thorough review that would be much appreciated. @lkursell @johnchase @gregcaporaso