Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major overhaul of internals #51

Merged
merged 1 commit into from
Aug 25, 2016
Merged

Conversation

wdwvt1
Copy link
Contributor

@wdwvt1 wdwvt1 commented Jul 15, 2016

This PR overhauls the ST2 code to do several important things:

  1. Adds full table output - the actual contingency table that each draw from the Gibb's sampler generates is now returned by the (API function) _gibbs and _gibbs_loo.
  2. Adds _gibbs_loo method - this is the API for leave-one-out calls and can now be run in parallel in the same manner that _gibbs can.
  3. Update internals to use pandas - much of the code dealing with e.g. nonoverlapping metadata and feature sample sets has been moved to simple pandas operations.

There are a variety of bug fixes as well (no more nans, checks for fractional count tables, etc.)

The test examples work and the test code passes, but this is a significant update so if ya'll could give a thorough review that would be much appreciated. @lkursell @johnchase @gregcaporaso

@gregcaporaso
Copy link
Member

@wdwvt1, ok if I get to this next week? I won't have time tomorrow.

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Jul 15, 2016

@gregcaporaso - yeah, absolutely.

This was referenced Jul 15, 2016
prediction and leave-one-out (LOO) classification. These calls are
``_gibbs`` and ``_gibbs_loo``.
* The per-sink feature assignments are recorded for every run and written to
the output directory. They are named as X.contingency.txt where X is the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X.full_results.txt would match Dan's original nomenclature from his most recent release, and is a bit more descriptive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to change this, but let me give my rationale for this.

X.full_results.txt doesn't suggest what the object is. The full output is a contingency table of sources X features.

Copy link
Member

@gregcaporaso gregcaporaso Aug 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. We've been moving away from contingency table toward feature table. What if it was X.table.txt? If you do this, you should indicate that these are the same as the X.full_results.txt files from SourceTracker 1.

@lkursell
Copy link
Collaborator

@wdwvt1 I took a pass, but we need @gregcaporaso for the main review

@gregcaporaso
Copy link
Member

Sorry guys, I'll review on Monday at the latest.


In brief, this script requires a feature X sample contingency table
(traditionally an OTU table) and sample metadata (traditionally a mapping file).
Any feature table that can be read by the `biom` package will be acceptable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying a version for biom here is important (since different versions could read different files). I recommend changing this to:

that can be read by biom-format >= 2.1.4 < 2.2.0 will be ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything above 2.1.4 should work. Changed to reflect this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that backward compatibility won't be guaranteed >= 2.2.0, so be sure to include that as an upper boundary.

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Aug 11, 2016

@gregcaporaso @johnchase @lkursell
have addressed all the comments, and made a couple improvements to type checking etc.

If you want to take another pass that'd be great.

This commit removes the intermediate file writing - everything is stored in memory now. This is much easier to handle in a lot of cases, but in some of my large simulations I have been seeing segmentation faults. Can you guys please use this branch in your analyses and tell me if you are getting them?

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Aug 18, 2016

Simulations with a large data set revealed a heinous memory leak: 5 minutes of run time with a large enough table would end up requesting 80gb of memory. The source of the memory leak appeared to be a cyclic reference in the Sampler class. In addition, storing the full table as output during the gibbs_sampler runs was causing significant memory allocation problems.

To resolve these problems I eliminated the Sampler class, and rewrote gibbs_sampler so that it produces only the vectors needed for calculating the proportions and the full output (but does not do it itself).

Benchmarks with this PR show that you can now do a ST2 run of 100 sinks, 350 sources, 8000 features, 10 restarts, 10 draws/restart, 100 burnins, in 30 minutes when using 6 cores, without exceeding 1gb of total memory use.

@gregcaporaso can you please review? @johnchase and @lkursell - your review would also be appreciated if you have time.
@nscott1 - this PR will resolve the unreasonable runtime problems you were seeing.

This PR now solves issues #55 #58 #13 #21 #52 #41

>>> print(ts)
np.array([0, 0, 0, 0, 3, 3, 3, 3, 3, 4, 4])
'''
taxon_sequence = np.zeros(int(sink_data.sum()), dtype=np.int32)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring indicates sink_data is int, so is the cast necessary?

why np.int32?

note, given the additional cast in the forloop, it may make sense to just do:

sink_data = sink_data.astype(int)

...as the first thing in the method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Aug 21, 2016

@wasade - thanks for taking a look! The np.int32 calls were because I was worried about memory usage. The memory usage was was not related to them at all, but I ended up leaving those in there. I am happy to remove them if you think that its poor practice/unnecessary. The only failure case I could see is if someone had counts that were greater than a np.int32 could handle. In that case, ST2 is just not an appropriate tool - there would never be enough burnins for that to be reasonably well assigned. Perhaps that bears mentioning in the documentation/paper as we write it.

You are absolutely write about the cythonization. The vast majority of code runtime is consumed in the inner loop (primarily in the updating the p_tv * p_v and in drawing a value from that with np.choice. Your email about sampling from a discrete distribution might lead to an improvement here as well.

@wasade
Copy link

wasade commented Aug 21, 2016

Ah, makes sense with int32. It shouldn't be a problem to keep.

Do you have a profile to share? I'd be happy to help with some cythonizing
if you want to go that route.

On Aug 20, 2016 17:42, "Will Van Treuren" [email protected] wrote:

@wasade https://github.com/wasade - thanks for taking a look! The
np.int32 calls were because I was worried about memory usage. The memory
usage was was not related to them at all, but I ended up leaving those in
there. I am happy to remove them if you think that its poor
practice/unnecessary. The only failure case I could see is if someone had
counts that were greater than a np.int32 could handle. In that case, ST2
is just not an appropriate tool - there would never be enough burnins for
that to be reasonably well assigned. Perhaps that bears mentioning in the
documentation/paper as we write it.

You are absolutely write about the cythonization. The vast majority of
code runtime is consumed in the inner loop (primarily in the updating the p_tv

  • p_v and in drawing a value from that with np.choice. Your email about
    sampling from a discrete distribution might lead to an improvement here as
    well.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#51 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc8sn8U_up3-yYH-EJKUPg_iowiugY_ks5qh58BgaJpZM4JNBOB
.

@@ -157,6 +174,10 @@ rarefaction depth of 1500**
**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs**
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5`

**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs. Write the full output.**
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this command is missing --full_output (it's the same as the command above).

Maybe for a future PR, but could we be more descriptive than --full_output? It'd be nice if this parameter described what the output was (and I think that's more important than matching what ST1 did).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you make this -o example7? This makes it easier to run the commands back-to-back (which I do as part of testing, but which we'll ultimately want to automate in the future).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just noticed a typo in example 5 - the text should say depth of 2500. I know that wasn't part of this PR, but could you fix that when you make these other changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed these three. I added more description as part of a significant update of the README.md.

@gregcaporaso
Copy link
Member

@wdwvt1 I spent some time on this today and it's looking reasonable. I didn't spend as much time as I did on my first pass, but if there are specific areas that you want me to look at in more detail I can do that. Just let me know what those are. I'd like to look through this a little more tomorrow - in particular I want to spend more time reviewing the test code.

Have you gone through the previous comments to make sure that you've addressed everything? It's hard to tell since so much changed since my last review.

Also, have you confirmed that all new methods and new functionality now have tests? I think it might be worth starting to compute test coverage on this repo. I started experimenting with this in another pull request.

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Aug 23, 2016

Thanks - I'll fix the documentation, typos, and camel case.

Have fixed previous comments except for the mapping file loading
thing/issue you opened. Haven't seen loading a mapping file give an error
and the test code in skbio doesn't have an example of a mapping file that
triggers the errors the function is supposed to catch (that I saw). Is
there an example of a mapping file that fails?
The linked code also doesn't make sense to me, is it catching a case with
leading comments?

I'm mainly interested in a review of the changes to the internal code,
specifically the elimination of the 'Sampler' class and the removal of the
proportion calculation from 'gibbs_sampler'.

On Aug 22, 2016 5:22 PM, "Greg Caporaso" [email protected] wrote:

@wdwvt1 https://github.com/wdwvt1 I spent some time on this today and
it's looking reasonable. I didn't spend as much time as I did on my first
pass, but if there are specific areas that you want me to look at in more
detail I can do that. Just let me know what those are. I'd like to look
through this a little more tomorrow - in particular I want to spend more
time reviewing the test code.

Have you gone through the previous comments to make sure that you've
addressed everything? It's hard to tell since so much changed since my last
review.

Also, have you confirmed that all new methods and new functionality now
have tests? I think it might be worth starting to compute test coverage on
this repo. I started experimenting with this in another pull request.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#51 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA__-Uu61QkUEfIawaD_SFI9tC-LDD5Bks5qij04gaJpZM4JNBOB
.

@wdwvt1
Copy link
Contributor Author

wdwvt1 commented Aug 23, 2016

Also yes to addition of tests.

Coverage testing sounds good, though I think if I had to choose between
that and more eyes on the internals of the algorithm (comparison of Dan's
sourcetracker internals to our 'gibbs_sampler') I'd choose the latter.

On Aug 22, 2016 5:55 PM, "Will Van Treuren" [email protected] wrote:

Thanks - I'll fix the documentation, typos, and camel case.

Have fixed previous comments except for the mapping file loading
thing/issue you opened. Haven't seen loading a mapping file give an error
and the test code in skbio doesn't have an example of a mapping file that
triggers the errors the function is supposed to catch (that I saw). Is
there an example of a mapping file that fails?
The linked code also doesn't make sense to me, is it catching a case with
leading comments?

I'm mainly interested in a review of the changes to the internal code,
specifically the elimination of the 'Sampler' class and the removal of the
proportion calculation from 'gibbs_sampler'.

On Aug 22, 2016 5:22 PM, "Greg Caporaso" [email protected] wrote:

@wdwvt1 https://github.com/wdwvt1 I spent some time on this today and
it's looking reasonable. I didn't spend as much time as I did on my first
pass, but if there are specific areas that you want me to look at in more
detail I can do that. Just let me know what those are. I'd like to look
through this a little more tomorrow - in particular I want to spend more
time reviewing the test code.

Have you gone through the previous comments to make sure that you've
addressed everything? It's hard to tell since so much changed since my last
review.

Also, have you confirmed that all new methods and new functionality now
have tests? I think it might be worth starting to compute test coverage on
this repo. I started experimenting with this in another pull request.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#51 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA__-Uu61QkUEfIawaD_SFI9tC-LDD5Bks5qij04gaJpZM4JNBOB
.

@gregcaporaso
Copy link
Member

gregcaporaso commented Aug 23, 2016

more eyes on the internals of the algorithm

Can you link me to the implementation that you'd like me to compare against? I'm not very familiar with that code. I can do that in addition to looking at the tests.

@coveralls
Copy link

Coverage Status

Coverage increased (+6.09%) to 87.912% when pulling e551825 on wdwvt1:pd_internals into 3927d14 on biota:master.

…tprint and improve the candidate API calls. The internal containers for data are now pandas DataFrames, significantly reducing parsing/indexing code. The README.md has been updated to reflect the new API access.
@coveralls
Copy link

Coverage Status

Coverage increased (+6.09%) to 87.912% when pulling 07f67e4 on wdwvt1:pd_internals into 3927d14 on biota:master.

@gregcaporaso
Copy link
Member

Thanks @wdwvt1! As discussed on our call today, we can deal with the remaining items that were brought up here through the new issues that we created.

@gregcaporaso gregcaporaso merged commit 08112ca into caporaso-lab:master Aug 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants