-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetching only n genomes with genbank_get_genomes_by_taxon.py #94
Comments
Currently, there is an option I'm reluctant to implement exactly what you ask for, as there are more practical issues with following up apparent problems (e.g. misidentified taxa) when downloading random subsets, than when subsampling from a local directory. Also, the issue of misnaming/misclassification makes it difficult to be sure that the subgroup one downloads is representative - more so when few sequences are downloaded - and especially makes it risky to assume that the assigned nomenclature reflects (or does not) the other organisms with that nomenclature, or the clade you want to download. If I were in the position that I wanted to download only four or five randomly-chosen genomes for a species, I'd do it manually - which is likely as quick as running the script ;) |
I would like to mention here that sometimes I like to use genbank_get_genomes_by_taxon.py to gather reference genomes for larger taxonomic groups (sometimes up to phylum-level). Especially in cases where clinically relevant species are involved, that can really bloat up the downloaded dataset (can be several thousand genomes, even if there only a few hundred species) due to multiple published genomes of basically identical strains (and sometimes even different versions of the same genome) In those cases, I usually gather a list of type strains for each species, and automatically filter the result to remove all non-type-strains from the downloaded-dataset, at least for all species of which the type strain was sequenced. In cases of species where only non-type strains have been sequenced (also happens quite often), I select the strain with the largest genome that contains protein annotations with the least number of contigs. Assuming these infos might also be available directly in entrez (type strain, genome size, number of contigs, number of CDS), maybe this could be incorporated as an option in the script? To me this seems more generally applicable than random subsampling, as this approach favours the most complete representative for each species, and would reduce the amount of unwanted downloads. |
I like the idea of restricting downloads only to those indicated as representative genomes. These should be well-defined in the database and reproducible, and the choice controllable by a single switch (e.g. I'm less keen on the ad hoc rules for selecting a pseudo-type strain, as even from the short criteria set you mention there's more than one reasonable interpretation (e.g. should we prefer fewer contigs or slightly longer genome?). Restricting to genomes A note of caution with type strains: some of the existing NCBI classifications under a single species-level taxID look - to me - to be clearly quite different species. Also, for one species we care a lot about here, the nominated type strain is about the least representative genome available so far, out of around 60 available! ;) |
Hi, yes I also found that the type strains are not necessarily actually representative for a species. Often they were just the first or most easy to Isolate. It's just that, when you limit a set of reference organisms because you only need (or can only handle) about one representative per species, it's often easiest to argue that you simply selected the type strains. I agree that selecting "the one" representative for each species is tricky and highly dependent on what your specific goal is. I'd be especially happy about the "--with-cds" and "--max-contigs" options though, because that would help to exclude non-annotated and extremely fragmented draft genomes. |
Specific interface changes - a checklist
|
Summary:
Basically, there is sometimes 100+ genomes fetched from a taxonomic ID (even at species lvl)
genbank_get_genomes_by_taxon.py
Is it possible to add an option like:
genbank_get_genomes_by_taxon.py -max 5 to fetch only 5 genomes.
And if you enter the taxonomic ID, it will be nice to have the possibility to fetch only 1 or 2 genomes or each species within the genus and exclude / include Genus sp. (unidentified species within the genus)
Simply it's only suggestions.
Pyani is still awesome
The text was updated successfully, but these errors were encountered: