Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoDP RefSeq accession is used as antiSMASH accession #76

Closed
6 tasks
CunliangGeng opened this issue Aug 9, 2022 · 7 comments
Closed
6 tasks

PoDP RefSeq accession is used as antiSMASH accession #76

CunliangGeng opened this issue Aug 9, 2022 · 7 comments
Assignees

Comments

@CunliangGeng
Copy link
Member

CunliangGeng commented Aug 9, 2022

There are a few correlated issues:

  1. PoDP schema shows that RefSeq_accession is will search on endpoint https://www.ncbi.nlm.nih.gov/nuccore/ (this is the endpoint of GenBank), but actually RefSeq has its own endpoint https://www.ncbi.nlm.nih.gov/refseq,

TODO:

  • the RefSeq endpoint in the schema should be updated

  1. PoDP users fill the RefSeq_accession section with Assembly accession, e.g. the project shown below, the GCF_000515175.1 is actually an Assembly accession but not a RefSeq accession
    image

RefSeq and Assembly accessions have different prefixes:

  • RefSeq accession prefix list
  • Assembly Accession have two prefixes: GCF and GCA
    • The assembly accession starts with a three letter prefix, GCA for GenBank assemblies and GCF for RefSeq assemblies. This is followed by an underscore and 9 digits.

TODO:

  • PoDP should validate the prefix to make sure the filled data is valid
  • PoDP should add Assembly section for Assembly data
  • PoDP should ask users to correct their mis-filling of RefSeq_accession section

  1. (it looks) antiMASH database use Assembly accession as its own accession, e.g. antiSMASH GCF_000814765.1 and the corresponding Assembly data

When NPLinker tries to download antiSMASH data, it actually extracts the accession from PoDP RefSeq_accession section , and then use that accession to query antiSMASH database to download pre-run data. But just as discussed in question 2, RefSeq_accession is a wrong place to fill Assembly accession and PoDP schema should add another section for Assembly_accession.

Question

  • does antiSMASH database always use Assembly accession as its own accession?

TODO:

  • when PoDP is updated for Assembly accession, NPLinker should also be updated to extract accession from corrected PoDP section for downloading antiSMASH data
@CunliangGeng CunliangGeng added this to the Enable scalable milestone Aug 9, 2022
@justinjjvanderhooft
Copy link

Thanks for bringing this up @CunliangGeng! I have invited Marnix Medema to the NPLinker repo as well, as he is more versed in this area than I am, but I will do my best below.

@justinjjvanderhooft
Copy link

Re 3. Could you elaborate a little bit on what you mean here? I think in the future, it would be good to further improve the PoDP scheme and allow users to point to a (pre-run) antismashDB result - is that what you mean by "antiSMASH accession'? As I don't understand what that is....antiSMASH runs on sequenced genomes and results in predicted gene clusters for each genomic sample.

@justinjjvanderhooft
Copy link

Re 1. We are aware that the current scheme is not perfect and that sometimes genomes cannot be found. In a next version of the PoDP scheme, it could be worth checking if changing the endpoint results in a higher download % for most of the projects. Hard to say if this will the case based on our experience so far.

@justinjjvanderhooft
Copy link

justinjjvanderhooft commented Aug 12, 2022

Re 2. Let's discuss this in a meeting as things can get rather confusing rather quickly.... For example, for the project you are linking to, the genomic data is actually stored at JGI: i.e., https://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2515154093 - and when searching for the data at NCBI, you can find a suppressed record with the GCF code: https://www.ncbi.nlm.nih.gov/assembly/GCF_000515175.1/ - and there are others as well: https://www.ncbi.nlm.nih.gov/assembly/organism/168697/latest/

@CunliangGeng
Copy link
Member Author

Hi Justin, I updated the issue description to try to make it more clear and answer your questions. Let's discuss this issue in next meeting.

@justinjjvanderhooft
Copy link

thanks for some clarifications - let's discuss further in our next meeting - also to what extent we can update the PoDP in the current project. Some lower-level adjustments are to improve/extent the user instructions :-)

@CunliangGeng
Copy link
Member Author

Reported issues to PoDP iomega/paired-data-form#249.
Close this issue then.

@CunliangGeng CunliangGeng closed this as not planned Won't fix, can't repro, duplicate, stale Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants