Major overhaul of internals #51

wdwvt1 · 2016-07-15T02:20:44Z

This PR overhauls the ST2 code to do several important things:

Adds full table output - the actual contingency table that each draw from the Gibb's sampler generates is now returned by the (API function) _gibbs and _gibbs_loo.
Adds _gibbs_loo method - this is the API for leave-one-out calls and can now be run in parallel in the same manner that _gibbs can.
Update internals to use pandas - much of the code dealing with e.g. nonoverlapping metadata and feature sample sets has been moved to simple pandas operations.

There are a variety of bug fixes as well (no more nans, checks for fractional count tables, etc.)

The test examples work and the test code passes, but this is a significant update so if ya'll could give a thorough review that would be much appreciated. @lkursell @johnchase @gregcaporaso

gregcaporaso · 2016-07-15T05:44:38Z

@wdwvt1, ok if I get to this next week? I won't have time tomorrow.

wdwvt1 · 2016-07-15T15:48:37Z

@gregcaporaso - yeah, absolutely.

lkursell · 2016-07-18T22:06:22Z

ChangeLog.md

+   prediction and leave-one-out (LOO) classification. These calls are 
+   ``_gibbs`` and ``_gibbs_loo``.
+ * The per-sink feature assignments are recorded for every run and written to
+   the output directory. They are named as X.contingency.txt where X is the


X.full_results.txt would match Dan's original nomenclature from his most recent release, and is a bit more descriptive.

I agree that this would be better.

I am happy to change this, but let me give my rationale for this.

X.full_results.txt doesn't suggest what the object is. The full output is a contingency table of sources X features.

That's true. We've been moving away from contingency table toward feature table. What if it was X.table.txt? If you do this, you should indicate that these are the same as the X.full_results.txt files from SourceTracker 1.

lkursell · 2016-07-18T22:56:25Z

@wdwvt1 I took a pass, but we need @gregcaporaso for the main review

gregcaporaso · 2016-07-29T15:05:19Z

Sorry guys, I'll review on Monday at the latest.

gregcaporaso · 2016-08-01T20:01:04Z

README.md

+
+In brief, this script requires a feature X sample contingency table
+(traditionally an OTU table) and sample metadata (traditionally a mapping file).
+Any feature table that can be read by the `biom` package will be acceptable


Specifying a version for biom here is important (since different versions could read different files). I recommend changing this to:

that can be read by biom-format >= 2.1.4 < 2.2.0 will be ...

Anything above 2.1.4 should work. Changed to reflect this.

Note that backward compatibility won't be guaranteed >= 2.2.0, so be sure to include that as an upper boundary.

wdwvt1 · 2016-08-11T18:39:19Z

@gregcaporaso @johnchase @lkursell
have addressed all the comments, and made a couple improvements to type checking etc.

If you want to take another pass that'd be great.

This commit removes the intermediate file writing - everything is stored in memory now. This is much easier to handle in a lot of cases, but in some of my large simulations I have been seeing segmentation faults. Can you guys please use this branch in your analyses and tell me if you are getting them?

wdwvt1 · 2016-08-18T03:56:26Z

Simulations with a large data set revealed a heinous memory leak: 5 minutes of run time with a large enough table would end up requesting 80gb of memory. The source of the memory leak appeared to be a cyclic reference in the Sampler class. In addition, storing the full table as output during the gibbs_sampler runs was causing significant memory allocation problems.

To resolve these problems I eliminated the Sampler class, and rewrote gibbs_sampler so that it produces only the vectors needed for calculating the proportions and the full output (but does not do it itself).

Benchmarks with this PR show that you can now do a ST2 run of 100 sinks, 350 sources, 8000 features, 10 restarts, 10 draws/restart, 100 burnins, in 30 minutes when using 6 cores, without exceeding 1gb of total memory use.

@gregcaporaso can you please review? @johnchase and @lkursell - your review would also be appreciated if you have time.
@nscott1 - this PR will resolve the unreasonable runtime problems you were seeing.

This PR now solves issues #55 #58 #13 #21 #52 #41

wasade · 2016-08-18T03:58:07Z

sourcetracker/_sourcetracker.py

+    >>> print(ts)
+    np.array([0, 0, 0, 0, 3, 3, 3, 3, 3, 4, 4])
+    '''
+    taxon_sequence = np.zeros(int(sink_data.sum()), dtype=np.int32)


docstring indicates sink_data is int, so is the cast necessary?

why np.int32?

note, given the additional cast in the forloop, it may make sense to just do:

sink_data = sink_data.astype(int)

...as the first thing in the method

wdwvt1 · 2016-08-21T00:42:41Z

@wasade - thanks for taking a look! The np.int32 calls were because I was worried about memory usage. The memory usage was was not related to them at all, but I ended up leaving those in there. I am happy to remove them if you think that its poor practice/unnecessary. The only failure case I could see is if someone had counts that were greater than a np.int32 could handle. In that case, ST2 is just not an appropriate tool - there would never be enough burnins for that to be reasonably well assigned. Perhaps that bears mentioning in the documentation/paper as we write it.

You are absolutely write about the cythonization. The vast majority of code runtime is consumed in the inner loop (primarily in the updating the p_tv * p_v and in drawing a value from that with np.choice. Your email about sampling from a discrete distribution might lead to an improvement here as well.

wasade · 2016-08-21T04:36:53Z

Ah, makes sense with int32. It shouldn't be a problem to keep.

Do you have a profile to share? I'd be happy to help with some cythonizing
if you want to go that route.

On Aug 20, 2016 17:42, "Will Van Treuren" [email protected] wrote:

@wasade https://github.com/wasade - thanks for taking a look! The
np.int32 calls were because I was worried about memory usage. The memory
usage was was not related to them at all, but I ended up leaving those in
there. I am happy to remove them if you think that its poor
practice/unnecessary. The only failure case I could see is if someone had
counts that were greater than a np.int32 could handle. In that case, ST2
is just not an appropriate tool - there would never be enough burnins for
that to be reasonably well assigned. Perhaps that bears mentioning in the
documentation/paper as we write it.

You are absolutely write about the cythonization. The vast majority of
code runtime is consumed in the inner loop (primarily in the updating the p_tv

p_v and in drawing a value from that with np.choice. Your email about
sampling from a discrete distribution might lead to an improvement here as
well.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#51 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc8sn8U_up3-yYH-EJKUPg_iowiugY_ks5qh58BgaJpZM4JNBOB
.

gregcaporaso · 2016-08-22T23:15:15Z

README.md

@@ -157,6 +174,10 @@ rarefaction depth of 1500**
 **Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs**  
 `sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5`

+**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs. Write the full output.**  
+`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5`


I think this command is missing --full_output (it's the same as the command above).

Maybe for a future PR, but could we be more descriptive than --full_output? It'd be nice if this parameter described what the output was (and I think that's more important than matching what ST1 did).

Also, can you make this -o example7? This makes it easier to run the commands back-to-back (which I do as part of testing, but which we'll ultimately want to automate in the future).

Also, I just noticed a typo in example 5 - the text should say depth of 2500. I know that wasn't part of this PR, but could you fix that when you make these other changes?

Fixed these three. I added more description as part of a significant update of the README.md.

gregcaporaso · 2016-08-23T00:22:16Z

@wdwvt1 I spent some time on this today and it's looking reasonable. I didn't spend as much time as I did on my first pass, but if there are specific areas that you want me to look at in more detail I can do that. Just let me know what those are. I'd like to look through this a little more tomorrow - in particular I want to spend more time reviewing the test code.

Have you gone through the previous comments to make sure that you've addressed everything? It's hard to tell since so much changed since my last review.

Also, have you confirmed that all new methods and new functionality now have tests? I think it might be worth starting to compute test coverage on this repo. I started experimenting with this in another pull request.

wdwvt1 · 2016-08-23T00:55:38Z

Thanks - I'll fix the documentation, typos, and camel case.

Have fixed previous comments except for the mapping file loading
thing/issue you opened. Haven't seen loading a mapping file give an error
and the test code in skbio doesn't have an example of a mapping file that
triggers the errors the function is supposed to catch (that I saw). Is
there an example of a mapping file that fails?
The linked code also doesn't make sense to me, is it catching a case with
leading comments?

I'm mainly interested in a review of the changes to the internal code,
specifically the elimination of the 'Sampler' class and the removal of the
proportion calculation from 'gibbs_sampler'.

On Aug 22, 2016 5:22 PM, "Greg Caporaso" [email protected] wrote:

@wdwvt1 https://github.com/wdwvt1 I spent some time on this today and
it's looking reasonable. I didn't spend as much time as I did on my first
pass, but if there are specific areas that you want me to look at in more
detail I can do that. Just let me know what those are. I'd like to look
through this a little more tomorrow - in particular I want to spend more
time reviewing the test code.

Have you gone through the previous comments to make sure that you've
addressed everything? It's hard to tell since so much changed since my last
review.

Also, have you confirmed that all new methods and new functionality now
have tests? I think it might be worth starting to compute test coverage on
this repo. I started experimenting with this in another pull request.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#51 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA__-Uu61QkUEfIawaD_SFI9tC-LDD5Bks5qij04gaJpZM4JNBOB
.

wdwvt1 · 2016-08-23T01:02:24Z

Also yes to addition of tests.

Coverage testing sounds good, though I think if I had to choose between
that and more eyes on the internals of the algorithm (comparison of Dan's
sourcetracker internals to our 'gibbs_sampler') I'd choose the latter.

On Aug 22, 2016 5:55 PM, "Will Van Treuren" [email protected] wrote:

Thanks - I'll fix the documentation, typos, and camel case.

Have fixed previous comments except for the mapping file loading
thing/issue you opened. Haven't seen loading a mapping file give an error
and the test code in skbio doesn't have an example of a mapping file that
triggers the errors the function is supposed to catch (that I saw). Is
there an example of a mapping file that fails?
The linked code also doesn't make sense to me, is it catching a case with
leading comments?

I'm mainly interested in a review of the changes to the internal code,
specifically the elimination of the 'Sampler' class and the removal of the
proportion calculation from 'gibbs_sampler'.

On Aug 22, 2016 5:22 PM, "Greg Caporaso" [email protected] wrote:

@wdwvt1 https://github.com/wdwvt1 I spent some time on this today and
it's looking reasonable. I didn't spend as much time as I did on my first
pass, but if there are specific areas that you want me to look at in more
detail I can do that. Just let me know what those are. I'd like to look
through this a little more tomorrow - in particular I want to spend more
time reviewing the test code.

Have you gone through the previous comments to make sure that you've
addressed everything? It's hard to tell since so much changed since my last
review.

Also, have you confirmed that all new methods and new functionality now
have tests? I think it might be worth starting to compute test coverage on
this repo. I started experimenting with this in another pull request.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#51 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA__-Uu61QkUEfIawaD_SFI9tC-LDD5Bks5qij04gaJpZM4JNBOB
.

gregcaporaso · 2016-08-23T02:26:07Z

more eyes on the internals of the algorithm

Can you link me to the implementation that you'd like me to compare against? I'm not very familiar with that code. I can do that in addition to looking at the tests.

coveralls · 2016-08-24T19:20:27Z

Coverage increased (+6.09%) to 87.912% when pulling e551825 on wdwvt1:pd_internals into 3927d14 on biota:master.

…tprint and improve the candidate API calls. The internal containers for data are now pandas DataFrames, significantly reducing parsing/indexing code. The README.md has been updated to reflect the new API access.

coveralls · 2016-08-24T23:49:23Z

Coverage increased (+6.09%) to 87.912% when pulling 07f67e4 on wdwvt1:pd_internals into 3927d14 on biota:master.

gregcaporaso · 2016-08-25T00:31:45Z

Thanks @wdwvt1! As discussed on our call today, we can deal with the remaining items that were brought up here through the new issues that we created.

This was referenced Jul 15, 2016

_gibbs fails on dataframes with floats #41

Closed

Full output #13

Closed

lkursell reviewed Jul 18, 2016
View reviewed changes

wdwvt1 mentioned this pull request Jul 22, 2016

Adding functionality to SourceTracker 2 #52

Closed

gregcaporaso reviewed Aug 1, 2016
View reviewed changes

wasade reviewed Aug 18, 2016
View reviewed changes

gregcaporaso reviewed Aug 22, 2016
View reviewed changes

wdwvt1 force-pushed the pd_internals branch from e551825 to 07f67e4 Compare August 24, 2016 23:34

gregcaporaso merged commit 08112ca into caporaso-lab:master Aug 25, 2016

johnchase mentioned this pull request Dec 17, 2018

Different results from different installs #114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major overhaul of internals #51

Major overhaul of internals #51

wdwvt1 commented Jul 15, 2016

gregcaporaso commented Jul 15, 2016

wdwvt1 commented Jul 15, 2016

lkursell Jul 18, 2016

gregcaporaso Aug 1, 2016

wdwvt1 Aug 1, 2016

gregcaporaso Aug 1, 2016 •

edited

Loading

lkursell commented Jul 18, 2016

gregcaporaso commented Jul 29, 2016

gregcaporaso Aug 1, 2016

wdwvt1 Aug 1, 2016

gregcaporaso Aug 1, 2016

wdwvt1 commented Aug 11, 2016

wdwvt1 commented Aug 18, 2016

wasade Aug 18, 2016

gregcaporaso Aug 23, 2016

wdwvt1 commented Aug 21, 2016

wasade commented Aug 21, 2016

gregcaporaso Aug 22, 2016

gregcaporaso Aug 22, 2016

gregcaporaso Aug 22, 2016

wdwvt1 Aug 23, 2016

gregcaporaso commented Aug 23, 2016

wdwvt1 commented Aug 23, 2016

wdwvt1 commented Aug 23, 2016

gregcaporaso commented Aug 23, 2016 •

edited

Loading

coveralls commented Aug 24, 2016

coveralls commented Aug 24, 2016

gregcaporaso commented Aug 25, 2016

Major overhaul of internals #51

Major overhaul of internals #51

Conversation

wdwvt1 commented Jul 15, 2016

gregcaporaso commented Jul 15, 2016

wdwvt1 commented Jul 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregcaporaso Aug 1, 2016 • edited Loading

Choose a reason for hiding this comment

lkursell commented Jul 18, 2016

gregcaporaso commented Jul 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wdwvt1 commented Aug 11, 2016

wdwvt1 commented Aug 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wdwvt1 commented Aug 21, 2016

wasade commented Aug 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregcaporaso commented Aug 23, 2016

wdwvt1 commented Aug 23, 2016

wdwvt1 commented Aug 23, 2016

gregcaporaso commented Aug 23, 2016 • edited Loading

coveralls commented Aug 24, 2016

coveralls commented Aug 24, 2016

gregcaporaso commented Aug 25, 2016

gregcaporaso Aug 1, 2016 •

edited

Loading

gregcaporaso commented Aug 23, 2016 •

edited

Loading