-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoDP RefSeq accession is used as antiSMASH accession #76
Comments
Thanks for bringing this up @CunliangGeng! I have invited Marnix Medema to the NPLinker repo as well, as he is more versed in this area than I am, but I will do my best below. |
Re 3. Could you elaborate a little bit on what you mean here? I think in the future, it would be good to further improve the PoDP scheme and allow users to point to a (pre-run) antismashDB result - is that what you mean by "antiSMASH accession'? As I don't understand what that is....antiSMASH runs on sequenced genomes and results in predicted gene clusters for each genomic sample. |
Re 1. We are aware that the current scheme is not perfect and that sometimes genomes cannot be found. In a next version of the PoDP scheme, it could be worth checking if changing the endpoint results in a higher download % for most of the projects. Hard to say if this will the case based on our experience so far. |
Re 2. Let's discuss this in a meeting as things can get rather confusing rather quickly.... For example, for the project you are linking to, the genomic data is actually stored at JGI: i.e., https://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2515154093 - and when searching for the data at NCBI, you can find a suppressed record with the GCF code: https://www.ncbi.nlm.nih.gov/assembly/GCF_000515175.1/ - and there are others as well: https://www.ncbi.nlm.nih.gov/assembly/organism/168697/latest/ |
Hi Justin, I updated the issue description to try to make it more clear and answer your questions. Let's discuss this issue in next meeting. |
thanks for some clarifications - let's discuss further in our next meeting - also to what extent we can update the PoDP in the current project. Some lower-level adjustments are to improve/extent the user instructions :-) |
Reported issues to PoDP iomega/paired-data-form#249. |
There are a few correlated issues:
RefSeq_accession
is will search on endpoint https://www.ncbi.nlm.nih.gov/nuccore/ (this is the endpoint of GenBank), but actually RefSeq has its own endpoint https://www.ncbi.nlm.nih.gov/refseq,TODO:
RefSeq_accession
section with Assembly accession, e.g. the project shown below, theGCF_000515175.1
is actually an Assembly accession but not a RefSeq accessionRefSeq and Assembly accessions have different prefixes:
GCF
andGCA
TODO:
Assembly
section for Assembly dataRefSeq_accession
sectionWhen NPLinker tries to download antiSMASH data, it actually extracts the accession from PoDP
RefSeq_accession
section , and then use that accession to query antiSMASH database to download pre-run data. But just as discussed in question 2,RefSeq_accession
is a wrong place to fill Assembly accession and PoDP schema should add another section forAssembly_accession
.Question
TODO:
The text was updated successfully, but these errors were encountered: