Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100 most wanted list #23

Open
gregcaporaso opened this issue Aug 6, 2012 · 31 comments
Open

100 most wanted list #23

gregcaporaso opened this issue Aug 6, 2012 · 31 comments
Labels

Comments

@gregcaporaso
Copy link
Contributor

The OTUs that are abundant across many environment types and distance from sequences in Greengenes/NCBI. We'll have to develop a sorting scheme for this, but would be a way to provide a list of the "most wanted" OTUs, or the high abundance cosmopolitan organisms that are not well-characterized.

@ghost ghost assigned jairideout Aug 6, 2012
@jairideout
Copy link
Member

Greg and I discussed this and decided on a sorting scheme. The most wanted list will only include "new" OTUs (i.e. ones that were created de novo, not from greengenes).

Sorting priorities:

  1. Sort by the number of environments the OTU is found in.
  2. Sort by the total count across all environments.
  3. Sort by % dissimilarity to greengenes.
  4. Sort by % dissimilarity to NCBI nr database.

Output should include a tab-separated table containing the sorted most wanted OTU IDs, sequence, greengenes assigned taxonomy, and NCBI closest sequence link.

Additional output should be an HTML table (for easy integration into the EMP website) that contains the information above plus a piechart showing the abundance of the OTU in each environment.

@jairideout
Copy link
Member

It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have).

A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).

  1. Filter to only include novel OTUs.
  2. Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500).
  3. Filter to only include OTUs that are in at least N environments/sample types.
  4. Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).
  5. BLAST the rest against nt and sort by % dissimilarity.
  6. Pick the top N from those.

We'll see how this works...

@rob-knight
Copy link

We could just look at the ones that were new clusters (i.e. don't have gg ids because they failed ref picking), right?

On Aug 8, 2012, at 11:27 AM, jrrideout wrote:

It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have).

A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).

  1. Filter to only include novel OTUs.
  2. Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500).
  3. Filter to only include OTUs that are in at least N environments/sample types.
  4. Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).
  5. BLAST the rest against nt and sort by % dissimilarity.
  6. Pick the top N from those.

We'll see how this works...


Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-7590793.

@jairideout
Copy link
Member

Yes, that will be the first step in the process, but I think we'll need to do additional filtering (steps 2-5) to get a good list, because many of these novel OTUs might be very similar to either gg seqs or nt seqs.

@gregcaporaso
Copy link
Contributor Author

@meganap, would you be able to help @jrrideout with some css magic to make the html table that he's putting together for this look a little nicer?

@meganap
Copy link
Contributor

meganap commented Aug 14, 2012

sure no prob

@jairideout
Copy link
Member

@meganap awesome, thanks! I'm finishing up some changes tonight and will have the table in the repo sometime tomorrow. Will let you know when it is ready.

@gregcaporaso
Copy link
Contributor Author

Once @meganap takes a crack at it, it'd be best to include her css in the
html generation code for future runs.

@jairideout
Copy link
Member

@meganap, the table is in the repo now under isme14/most_wanted_otus/most_wanted_otus.html. To view it, open it up in a web browser (I've tried out Chrome and Firefox) and it should find all of the other files it needs (they are all under that same directory).

I tried to keep styling to a minimum. The table has the id 'most_wanted_otus_table' and each of the subtables for the piechart legends have the class 'most_wanted_otus_legend'. If there's anything else I can do from my end to help make this HTML better stylizable, please let me know.

I think the goal was to add this table to one of the EMP webpages. Thus, I'm not sure if we should directly add the CSS to the table-generating code as @gregcaporaso suggested because it may better to just use the EMP CSS stylesheets that are already in use on the website. You may need to get in touch with @douginator2000 to get access to those if you don't have them already. If we go this route, the table-generating code will be able to create generic tables which can then be styled according to whatever website scheme it might be dropped into (thinking of additional uses for this table besides the EMP website).

Thanks again for your help with this, and please let me know if you come across any issues.

@gregcaporaso, this most wanted table does not include OTU tables 1288, 933, and 550 because they were too big to filter on an m2.4xlarge EC2 instance. You mentioned offline that there might be a way to get access to a node with more memory (>69GB). Do you still want to go this route, or just use the table that we have?

@jairideout
Copy link
Member

@meganap, I forgot to mention that the second column in the HTML table needs to keep its contents formatted as-is (I'm using pre tags currently, maybe there is a better way to do this though). We just need to keep it formatted with fixed-width font and have those linebreaks respected.

@meganap
Copy link
Contributor

meganap commented Aug 14, 2012

@jrrideout cool, I'll take a crack at this tomorrow

@gregcaporaso
Copy link
Contributor Author

Thanks guys!

@gregcaporaso, this most wanted table does not include OTU tables 1288, 933,
and 550 because they were too big to filter on an m2.4xlarge EC2 instance.

I think we just have to go with this for right now, but for the paper we'll get this running on a system with more memory.

@gregcaporaso
Copy link
Contributor Author

@douginator2000, when this is ready could you add a another collapsable section on the EMP login page (same place as the summary statistics, etc)?

@meganap
Copy link
Contributor

meganap commented Aug 17, 2012

@gregcaporaso @jrrideout Sorry I didn't get a chance to work on this yet since I was working on figures for other isme stuff, but is there still time for this?

@jairideout
Copy link
Member

I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things.

@rob-knight
Copy link

Yes still useful, deadline sunday

On Aug 17, 2012, at 4:35 PM, "jrrideout" <[email protected]mailto:[email protected]> wrote:

I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things.


Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-7834337.

@meganap
Copy link
Contributor

meganap commented Aug 17, 2012

hey @jrrideout I noticed that there aren't any html headers for the file and that it just starts off with divs. Is there a reason for this? Adding css styling is only possible if we have html headers.

@jairideout
Copy link
Member

@meganap, @gregcaporaso requested that I only output the HTML table so that it could be easily dropped into a webpage. Please feel free to modify/add to the HTML as needed to style it (this table will ultimately need to be added to the EMP login page).

@meganap
Copy link
Contributor

meganap commented Aug 17, 2012

@jrrideout I've edited the script that writes the html so it writes some stuff in a different way, can you send me the full command you used to run that script (like where the test files are?) so that I can rerun it?

@jairideout
Copy link
Member

@meganap I'll have to rerun it because it requires the entire nt database, and everything is already set up for this in an EC2 instance. Can you please update the accompanying unit tests and check in your changes? Once they're in, I'll rerun it and commit the latest results to the repo. It won't take long to run.

@jairideout
Copy link
Member

@meganap The changes are in; please let me know if you run into any issues.

@jairideout
Copy link
Member

@douginator2000 this is all ready to go. All relevant files are under isme14/most_wanted_otus/. The only file that you can exclude from there is 'analysis_notes.txt'. Thanks!

@meganap thanks for your help in spicing up the table- it looks really good!

@gregcaporaso
Copy link
Contributor Author

Hey guys,
This is awesome, thanks! Doug, could you get this accessible via the EMP
site?

In the meantime I posted here to make it easier for everyone else to see:
https://dl.dropbox.com/u/2868868/most_wanted_otus/most_wanted_otus.html

One thing we'll want to do is include the number of samples for each of the
metadata categories in addition to the percentage, but I think that can
wait. (Thanks for the suggestion Daniel!)

Greg

@rob-knight
Copy link

Yes this is spectacular -- thanks for putting together! Could we get a tree showing where in phylogeny the 100 most wanted are?

On Aug 19, 2012, at 11:53 AM, "Greg Caporaso" <[email protected]mailto:[email protected]> wrote:

Hey guys,
This is awesome, thanks! Doug, could you get this accessible via the EMP
site?

In the meantime I posted here to make it easier for everyone else to see:
https://dl.dropbox.com/u/2868868/most_wanted_otus/most_wanted_otus.html

One thing we'll want to do is include the number of samples for each of the
metadata categories in addition to the percentage, but I think that can
wait. (Thanks for the suggestion Daniel!)

Greg


Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-7851896.

@gilbertjack
Copy link

Am I right to think that the criteria for this are those that @jrrideout came up with:

  1. Filter to only include novel OTUs.
  2. Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500).
  3. Filter to only include OTUs that are in at least N environments/sample types.
  4. Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).
  5. BLAST the rest against nt and sort by % dissimilarity.

@gregcaporaso
Copy link
Contributor Author

Yes, that's right. @jrrideout, correct us if we're wrong here.

@gilbertjack
Copy link

ok but what were the N's for these two filters:
3) Filter to only include OTUs that are in at least N environments/sample types.

4) Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).

@jairideout
Copy link
Member

@gilbertjack The steps 1-5 listed above are what I used. Here's the parameters I ended up using:

  1. filtered out against gg 97
  2. abundance: 100 < OTU count < 500
  3. at least 4 environments
  4. included only OTUs that were at least 20% dissimilar (according to uclust) from gg 97
  5. only included OTUs that were 97% similar or less compared to the NCBI nt database (according to blastall)

So we only ended up with 45 OTUs that were left over after all of that filtering. Please let me know if you have any additional questions regarding how this list was generated.

@gregcaporaso @rob-knight I think these feature requests sound great, though I will not have time to work on them to meet the deadline today.

@gregcaporaso
Copy link
Contributor Author

Thanks a lot!

@gilbertjack
Copy link

AWESOME, thanks

@cuttlefishh
Copy link
Collaborator

@rob-knight said: EMP most wanted and picrust definitely valuable this time around (i.e. are there “most wanted” that are in environments with “interesting” parameters?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants