Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching only n genomes with genbank_get_genomes_by_taxon.py #94

Open
sbridel opened this issue Sep 11, 2017 · 5 comments
Open

Fetching only n genomes with genbank_get_genomes_by_taxon.py #94

sbridel opened this issue Sep 11, 2017 · 5 comments
Assignees
Labels
enhancement something we'd like pyani to do that it doesn't already interface issues related to how the user tells pyani to do something
Milestone

Comments

@sbridel
Copy link

sbridel commented Sep 11, 2017

Summary:

Basically, there is sometimes 100+ genomes fetched from a taxonomic ID (even at species lvl)
genbank_get_genomes_by_taxon.py

Is it possible to add an option like:
genbank_get_genomes_by_taxon.py -max 5 to fetch only 5 genomes.

And if you enter the taxonomic ID, it will be nice to have the possibility to fetch only 1 or 2 genomes or each species within the genus and exclude / include Genus sp. (unidentified species within the genus)

Simply it's only suggestions.

Pyani is still awesome

@widdowquinn
Copy link
Owner

Currently, there is an option --subsample which will randomly choose n genomes from within the supplied set of genomes in the input directory. That will do part of what you want - reduce the number of genomes being used in the analysis.

I'm reluctant to implement exactly what you ask for, as there are more practical issues with following up apparent problems (e.g. misidentified taxa) when downloading random subsets, than when subsampling from a local directory. Also, the issue of misnaming/misclassification makes it difficult to be sure that the subgroup one downloads is representative - more so when few sequences are downloaded - and especially makes it risky to assume that the assigned nomenclature reflects (or does not) the other organisms with that nomenclature, or the clade you want to download.

If I were in the position that I wanted to download only four or five randomly-chosen genomes for a species, I'd do it manually - which is likely as quick as running the script ;)

@widdowquinn widdowquinn added the wontfix we accept there is an issue but we cannot or will not address it label Oct 9, 2017
@jvollme
Copy link

jvollme commented Oct 18, 2017

I would like to mention here that sometimes I like to use genbank_get_genomes_by_taxon.py to gather reference genomes for larger taxonomic groups (sometimes up to phylum-level). Especially in cases where clinically relevant species are involved, that can really bloat up the downloaded dataset (can be several thousand genomes, even if there only a few hundred species) due to multiple published genomes of basically identical strains (and sometimes even different versions of the same genome)

In those cases, I usually gather a list of type strains for each species, and automatically filter the result to remove all non-type-strains from the downloaded-dataset, at least for all species of which the type strain was sequenced. In cases of species where only non-type strains have been sequenced (also happens quite often), I select the strain with the largest genome that contains protein annotations with the least number of contigs.

Assuming these infos might also be available directly in entrez (type strain, genome size, number of contigs, number of CDS), maybe this could be incorporated as an option in the script? To me this seems more generally applicable than random subsampling, as this approach favours the most complete representative for each species, and would reduce the amount of unwanted downloads.

@widdowquinn
Copy link
Owner

I like the idea of restricting downloads only to those indicated as representative genomes. These should be well-defined in the database and reproducible, and the choice controllable by a single switch (e.g. --reponly). I will have to look at the Entrez/EUtils docs to see what's needed, but it sounds feasible and useful.

I'm less keen on the ad hoc rules for selecting a pseudo-type strain, as even from the short criteria set you mention there's more than one reasonable interpretation (e.g. should we prefer fewer contigs or slightly longer genome?). Restricting to genomes --with-cds, --min-size, and --max-contigsis pretty feasible, but some of the niceties seem like they could be a little troublesome: some submitted sequences might have many short contigs that other submissions discard, and we might only care about contigs over a minimum length (depending on how/when sequenced, etc.) - I'm interested in implementing those options, but they'll likely come in v0.3 (currently classify branch).

A note of caution with type strains: some of the existing NCBI classifications under a single species-level taxID look - to me - to be clearly quite different species. Also, for one species we care a lot about here, the nominated type strain is about the least representative genome available so far, out of around 60 available! ;)

@widdowquinn widdowquinn added enhancement something we'd like pyani to do that it doesn't already and removed wontfix we accept there is an issue but we cannot or will not address it labels Oct 18, 2017
@widdowquinn widdowquinn self-assigned this Oct 18, 2017
@jvollme
Copy link

jvollme commented Oct 20, 2017

Hi, yes I also found that the type strains are not necessarily actually representative for a species. Often they were just the first or most easy to Isolate. It's just that, when you limit a set of reference organisms because you only need (or can only handle) about one representative per species, it's often easiest to argue that you simply selected the type strains.
But instead restricting it to the genomes selected at "representative genome" for any species by NCBI seems better (I guess they probably have some quality criteria involved in that selection also).

I agree that selecting "the one" representative for each species is tricky and highly dependent on what your specific goal is. I'd be especially happy about the "--with-cds" and "--max-contigs" options though, because that would help to exclude non-annotated and extremely fragmented draft genomes.

@widdowquinn widdowquinn added this to the 0.3.1 milestone May 29, 2020
@widdowquinn widdowquinn added the interface issues related to how the user tells pyani to do something label May 29, 2020
@widdowquinn
Copy link
Owner

Specific interface changes - a checklist

  • add --taxid-file (or similar) to use a predefined set of taxIDs; the format of the associated file should be discussed before implementation
  • add --type-only option to pyani download so that only Type species (as recorded at NCBI) are downloaded
  • add --with-cds option to pyani download so that only annotated genomes are downloaded
  • add --max-contigs option to pyani download so that only genomes with more than a specified number of contigs are downloaded; related question - is it possible to get enough information such that a minimum contig length can be defined when making this count of contig numbers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement something we'd like pyani to do that it doesn't already interface issues related to how the user tells pyani to do something
Projects
None yet
Development

No branches or pull requests

3 participants