-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GISAID all-sequences fasta should be directly usable by nextstrain/ncov #53
Comments
Raised as an idea after issue #52 and the thought that more such error issues may come our way from those using the file download. |
Thanks @brianpardy , I've linked to this on our internal Nextstrain convos so we can consider :) |
Hi @emmahodcroft, I did go ahead and create a simple script that works on my local install using the current gisaid_cov2020_sequences.fasta file. I committed it to my fork with brianpardy@b401051 It uses only cat, sed, awk, and grep. Call as: |
It is certainly the wrong way to do it but I also added a Snakefile rule called 'gisaid' that will run this script to create sequences.fasta from gisaid_cov2020_sequences.fasta. I don't know enough to change the Snakefile to replace the download rule with the gisaid rule, but calling "snakemake gisaid" on my copy will generate sequences.fasta, and "snakemake -f gisaid" will regenerate it when a new download from gisaid is placed in data/. |
Thanks @brianpardy ! We are looking into some possible solutions here. We are going to try and make this work better, but we'll need to iron out some details on how to organise that :) |
That sounds good to me, @emmahodcroft, if I helped spur some thought I'm happy. I did make one more change to my script on the embedded spaces item, I noticed the "Hong Kong" sequences were in metadata.tsv with the space removed, not converted to underscore, so they could not be matched, I fixed that. I'll also add some error checking for calls without naming the files on the commandline. If the team elects to use this great, if not I still appreciate having the issue considered. |
As @jameshadfield mentioned this issue in #57 I should add that as of right now the script I offered does not work perfectly with the current all sequences file from GISAID: the three new Hong Kong sequences EPI_ISL_412028, EPI_ISL_412029, and EPI_ISL_412030 have duplicate strain names to earlier submissions EPI_ISL_408975, EPI_ISL_409020, and EPI_ISL_409024. The awk statement in my normalize_gisaid_fasta script keeps only the first instance of a duplicate strain name and discards all additional instances. When run, my script will currently only keep the earlier, partial Spike glycoprotein sequences and will discard the newer, complete genomes. For the moment I am manually removing those three partial sequences from the GISAID download before running my script. I wanted to keep the script simple and obvious but it could probably be extended to keep the longest sequence found instead of the first, at the expense of readability and complexity. |
Sorry about those extra commits showing up on the issue log, I'm learning how to deal with branching properly so I can submit a pull request and I was not expecting that quite yet. Please ignore the first one. I updated my script to resolve this issue. I added a 3rd commandline parameter for minimum length that defaults to 15000. I am calling it from my Snakefile using params.min_length and it is working fine. This resolves, for now, the problem of normalize_gisaid_fasta.sh keeping the first appearing, shorter sequence, instead of the later appearing, complete sequence, when sequence names collide. I set this up on a clean branch on my fork that should merge cleanly if the team accepts the pull request I am about to submit for commit brianpardy@d3c90c7 No offense taken if unwanted. |
Hi @brianpardy , thank you for the work! Yes, these are the same issues we are running into on our end. We're still trying to figure out the best way to deal with this both for public users and for our own internal builds (which need to be aligned between all of us who update Nextstrain, etc, so are a bit more complicated). We're all a bit short on time at the moment unfortunately, so progress is slow - sorry! |
It may sound silly but I have to ask: where is the "all-sequences download button for SARS-CoV-2 sequences" in GISAID? I could not find it... |
On GISAID, in the EpiCoV tab - bottom right 'Download' button, under the table. |
You will need a GISAID account to do this. |
I registered. I can see each entry with a "download metadata" and a "download fasta" button. But I could not find a button to download all of them. |
You need to be on the main 'browse' screen that lists all of the deposited sequences, not the individual-sample screen that contains the 'download metadata' and 'download fasta' buttons. The button is just labeled "Download" with an icon on it, to the right of the screen paging tools. |
Thank you very much! do you mean the excel table? I can download an excel table with all the entries by clicking "Download Acknowledgement Table for all submissions here". But I still cannot get a fasta file... |
It looks like the download button is not appearing for you at all. Are you able to scroll your screen to the right? The page I see includes a download button as shown below. |
No, I don't have that button. Thank you both very much for replying! Now I have to try something else. |
@xzhuo : Same issue for me, gisaid removed the download button. Did you figure out an alternative solution? |
Not yet. A crawler? |
why no add the fasta of sars, mers and the others of the family ?, when is a new virus ? |
@pedroelbanquero sure, it can be done at one point. For now I've added parsing metadata of the samples to the scrapper, it should yield some interesting information |
I've been having trouble getting @wwydmanski's scraper to work (it errors at the end of the first page because the DOM seems to have changed). @melkebir's scraper does work. This is both on a macbook 10.14.6. |
@rvosa maybe it's OS dependent? I've tested it only on windows 10 |
Add normalize_gisaid_fasta.sh for issue #53
I'll leave this issue open for discussion. If you successfully download
and then just proceed with I don't understand what's going on with some people being able to see a "Download" button and others not able to. I'd suggest to continue to contact GISAID support about this. |
Thank you for the merge, @trvrb! As another followup to this, users running local nextstrain/ncov instances based on the normalized GISAID fasta download may notice inconsistencies in their local results vs those on the nextstrain.org site. Occasionally sequences released on GISAID are later withdrawn or set as non-public, at which point they no longer appear in the gisaid_cov2020_sequences.fasta file provided by GISAID. Nextstrain itself appears to be using an independent archive that does not always immediately reflect the removal of sequences from GISAID (though it has in the past). For example, the current GISAID download lacks many of the Guangdong sequences from March 9th currently visible on nextstrain.org/ncov. |
It looks like GISAID made some correction after contacting them via E-Mail. Now I see the Download button again. 😃 |
Is it possible to submit this validator script to GISAID to improve the data quality on their side? |
I cannot find the download all button on 2020/3/25, maybe they just removed it? The only way I can check the sequence is by Browsering, but that means to download one record at a time. Anyone else has the same problem? |
The sequence availability issue is something that is problematic beyond nextstrain per se. Perhaps it makes sense if someone from the nextstrain core kept an eye on the activities towards data sharing that are being developed by the participants of the covid-19 biohackathon. |
ZeweiSong, you need to email them. They should enable it for you then. Not at all clear why this is the case -- super frustrating in fact. But this is what happened for me. (also don't expect them to email you back, just check again 24 hours later and see if the button appears). |
Seconding @TrentBrick and @trvrb messages -- best way forward is to contact GISAID and request access. Please do not use scrapers -- with increasing number of sequences and number of interested users this would essentially amount to a denial of service. |
Not sure why my comment was removed - calling the Javascript function that triggers the download should be just as costly as using the download button itself. You would still need access to the page in the first place. |
I'm gearing up to formulate a request for data access and sharing on behalf of the biohackathon (there's a special covid-19 edition starting soon). I asked GISAID on twitter but I don't think they're very active there. I've had some interaction via their issue tracker so I'll next try in that way. Would it make sense to ask on behalf of (or with reference to) the nextstrain user community at the same time? Please let me know if I should do that. The general idea is not to nag or complain. I'm sure they're very busy right now. Also, I imagine they are simply under existing agreements with data submitters that they have to comply with. However, maybe there are other ways in which they can meet their obligations and still accomplish data access with less friction. That will probably involve both technical implementation and social busywork. It seems to me that there are many people willing and able to help with both of these right now. Something structural needs to improve that we mustn't try to address with screen scrapers and javascript backdoors. More and more researchers want to do good work with these data. It is part of GISAID's stated mission to enable that. We ought to work together to make that possible in an open and collaborative way. |
@rvosa maybe they (GISAID) think that the data will be used in a malicious way? Because if not, then maybe there is insufficient funding and poor technical excellence to avoid DoS. E.g. setup memcache. |
Is anyone else having trouble accessing GISAID right now? It was hard for me to create an account, but now that I have one the ncov tab just doesn't load. I'm not sure why it's so hard to obtain the sequences. Makes analyzing the data so much harder compared to the ones deposited in genbank. |
No trouble creating an account, but download requests keep throwing errors. |
Add detailed steps for how to obtain and normalize GISAID fasta, as resolved in nextstrain#53 and nextstrain#59 .
@palatos it's been up and down for me. Just keep re-trying. Fortunately the actual fasta download is pretty small and quick once you get in. |
@brianpardy I can't see the button for all genome sequence download,how can I figure it out? |
@xzhuo Are you able to download all genome sequence of ncov? |
@woson2020, as suggested above you'll need to request download access from GISAID. After you login, you can send a message through Contact page. They generally gave the access in a 1-2 days. |
I have already sent message,thank you for your help. |
I requested access yesterday and they enabled the download feature overnight. Just send them a polite note on their contact form describing why you need to access all the data. In my case I need it to validate a diagnostic against the latest isolates. When you get the data you will need to do some basic cleanup. Some of the header lines have spaces in the ID e.g. 'Hong Kong' rather than Hong_Kong. Some sequences have control characters and there are a few other anomalies. I'll post my cleanup script later if I get the chance - it's just basic regexp stuff. |
Here is my cleanup script as a gist https://gist.github.com/craic/790e57e3ea140797d66a9dccaaa098a2 |
Thank very much for your reply. And I have wrote a python script of selenium package(Web automation simulation click) to download all fasta file from GISAID last week.
…------------------
Zhuoxing Wu
PhD Candidate,Sun Yat-sen University
Address:No.74,Zhong Shan 2nd Road,Yuexiu District,Guangzhou,Guangdong Province,510080,P.R.China
Email:[email protected]
------------------ Original ------------------
From: "Rob Jones"<[email protected]>;
Date: Sat, Apr 11, 2020 11:36 PM
To: "nextstrain/ncov"<[email protected]>;
Cc: "woson2020"<[email protected]>; "Mention"<[email protected]>;
Subject: Re: [nextstrain/ncov] GISAID all-sequences fasta should be directly usable by nextstrain/ncov (#53)
I requested access yesterday and they enabled the download feature overnight. Just send them a polite note on their contact form describing why you need to access all the data. In my case I need it to validate a diagnostic against the latest isolates.
When you get the data you will need to do some basic cleanup. Some of the header lines have spaces in the ID e.g. 'Hong Kong' rather than Hong_Kong. Some sequences have control characters and there are a few other anomalies. I'll post my cleanup script later if I get the chance - it's just basic regexp stuff.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi all, just a reminder that scrapers are harmful to the functioning of GISAID. We and they ask kindly that you do not use them. |
@emmahodcroft I can fix the problem with downloads and data quality if I understand what's going on on GISAID side, but I can not reach them. Do you have any contact with them? |
No, I'm afraid my own lines of contact are through the same portals that are available publicly. |
I am not able to understand why USA/WA1/2020 was used to find the status of haplotype. Used in script annotate-haplotype-status.py |
I am not able to see Download acknowledgement table here link, which was there earlier from last couple of days.Also it is throwing internal error frequently.I messaged them by contact ,still the issue is not resolved.Please help me regarding this. |
Can you unsubscribe me from your mailing list?
On Sat, Apr 25, 2020 at 2:29 AM ps120195 ***@***.***> wrote:
Thank you very much! do you mean the excel table? I can download an excel
table with all the entries by clicking "Download Acknowledgement Table for
all submissions here". But I still cannot get a fasta file...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAGNA3EGRM5CXNOPF6PAYDROKUPRANCNFSM4KX3P2OA>
.
--
Cherrian Angela Chin
|
Click this --> https://github.com/notifications/unsubscribe-auth/AAAGNA3EGRM5CXNOPF6PAYDROKUPRANCNFSM4KX3P2OA Buy me a coffee if it works. ,) |
Closing this issue. |
@trvrb that some good news. I miss the "data journalism" that could reveal the full story. ) |
Filing this as an issue as suggested by @emmahodcroft:
GISAID provides an all-sequences download button for SARS-CoV-2 sequences. The provided file is not directly usable as a sequences.fasta file in nextstrain/ncov because of several issues in the GISAID file:
I suggest a new bash script in ncov/scripts/ that would optionally normalize the GISAID all-sequences download file so that users can use it directly without a need to manually remove duplicates or edit sequence names or maintain their own automated pipeline to generate data/sequences.fasta.
I also suggest automating this in the Snakefile if possible, but I'm not sure how. If no data/sequences.fasta file exists, but data/gisaid_cov2020_sequences.fasta exists, run scripts/normalize-gisaid-fasta.sh before the rest of the pipeline.
I can likely have a base script written to do the normalization by this evening if there is interest.
Thank you for considering this idea.
The text was updated successfully, but these errors were encountered: