GISAID all-sequences fasta should be directly usable by nextstrain/ncov #53

brianpardy · 2020-02-19T15:50:02Z

Filing this as an issue as suggested by @emmahodcroft:

GISAID provides an all-sequences download button for SARS-CoV-2 sequences. The provided file is not directly usable as a sequences.fasta file in nextstrain/ncov because of several issues in the GISAID file:

There is at least one duplicate sequence name (Italy/INMI1/2020) that causes errors in augur filter, and often other duplicates exist before they are renamed on GISAID
There are several sequences with "Hong Kong" in their names that cause errors in augur filter due to the sequence name being truncated at whitespace
The sequence names are appended with the EPI_ISL identifier and a datestamp, which are not stripped when loaded and cause mismatches with sequence names in metadata.tsv
The sequence names are prepended with 'BetaCoV' or 'BetaCov' which is not stripped when loaded and causes mismatches with metadata.tsv

I suggest a new bash script in ncov/scripts/ that would optionally normalize the GISAID all-sequences download file so that users can use it directly without a need to manually remove duplicates or edit sequence names or maintain their own automated pipeline to generate data/sequences.fasta.

I also suggest automating this in the Snakefile if possible, but I'm not sure how. If no data/sequences.fasta file exists, but data/gisaid_cov2020_sequences.fasta exists, run scripts/normalize-gisaid-fasta.sh before the rest of the pipeline.

I can likely have a base script written to do the normalization by this evening if there is interest.

Thank you for considering this idea.

emmahodcroft · 2020-02-19T15:52:15Z

Raised as an idea after issue #52 and the thought that more such error issues may come our way from those using the file download.

emmahodcroft · 2020-02-19T15:55:58Z

Thanks @brianpardy , I've linked to this on our internal Nextstrain convos so we can consider :)

brianpardy · 2020-02-19T17:17:04Z

Hi @emmahodcroft, I did go ahead and create a simple script that works on my local install using the current gisaid_cov2020_sequences.fasta file. I committed it to my fork with brianpardy@b401051

It uses only cat, sed, awk, and grep. Call as:
scripts/normalize_gisaid_fasta.sh data/gisaid_cov2020_sequences.fasta data/sequences.fasta

brianpardy · 2020-02-19T18:24:45Z

It is certainly the wrong way to do it but I also added a Snakefile rule called 'gisaid' that will run this script to create sequences.fasta from gisaid_cov2020_sequences.fasta. I don't know enough to change the Snakefile to replace the download rule with the gisaid rule, but calling "snakemake gisaid" on my copy will generate sequences.fasta, and "snakemake -f gisaid" will regenerate it when a new download from gisaid is placed in data/.

emmahodcroft · 2020-02-20T11:19:29Z

Thanks @brianpardy ! We are looking into some possible solutions here. We are going to try and make this work better, but we'll need to iron out some details on how to organise that :)

brianpardy · 2020-02-20T13:51:34Z

That sounds good to me, @emmahodcroft, if I helped spur some thought I'm happy. I did make one more change to my script on the embedded spaces item, I noticed the "Hong Kong" sequences were in metadata.tsv with the space removed, not converted to underscore, so they could not be matched, I fixed that. I'll also add some error checking for calls without naming the files on the commandline. If the team elects to use this great, if not I still appreciate having the issue considered.

brianpardy · 2020-02-25T01:50:57Z

As @jameshadfield mentioned this issue in #57 I should add that as of right now the script I offered does not work perfectly with the current all sequences file from GISAID: the three new Hong Kong sequences EPI_ISL_412028, EPI_ISL_412029, and EPI_ISL_412030 have duplicate strain names to earlier submissions EPI_ISL_408975, EPI_ISL_409020, and EPI_ISL_409024. The awk statement in my normalize_gisaid_fasta script keeps only the first instance of a duplicate strain name and discards all additional instances. When run, my script will currently only keep the earlier, partial Spike glycoprotein sequences and will discard the newer, complete genomes. For the moment I am manually removing those three partial sequences from the GISAID download before running my script.

I wanted to keep the script simple and obvious but it could probably be extended to keep the longest sequence found instead of the first, at the expense of readability and complexity.

brianpardy · 2020-02-25T13:50:08Z

Sorry about those extra commits showing up on the issue log, I'm learning how to deal with branching properly so I can submit a pull request and I was not expecting that quite yet. Please ignore the first one.

I updated my script to resolve this issue. I added a 3rd commandline parameter for minimum length that defaults to 15000. I am calling it from my Snakefile using params.min_length and it is working fine. This resolves, for now, the problem of normalize_gisaid_fasta.sh keeping the first appearing, shorter sequence, instead of the later appearing, complete sequence, when sequence names collide.

I set this up on a clean branch on my fork that should merge cleanly if the team accepts the pull request I am about to submit for commit brianpardy@d3c90c7

No offense taken if unwanted.

emmahodcroft · 2020-02-25T17:15:18Z

Hi @brianpardy , thank you for the work! Yes, these are the same issues we are running into on our end. We're still trying to figure out the best way to deal with this both for public users and for our own internal builds (which need to be aligned between all of us who update Nextstrain, etc, so are a bit more complicated). We're all a bit short on time at the moment unfortunately, so progress is slow - sorry!

xzhuo · 2020-03-05T16:51:26Z

It may sound silly but I have to ask: where is the "all-sequences download button for SARS-CoV-2 sequences" in GISAID? I could not find it...

emmahodcroft · 2020-03-05T16:52:38Z

On GISAID, in the EpiCoV tab - bottom right 'Download' button, under the table.

emmahodcroft · 2020-03-05T16:52:54Z

You will need a GISAID account to do this.

xzhuo · 2020-03-05T16:55:26Z

I registered. I can see each entry with a "download metadata" and a "download fasta" button. But I could not find a button to download all of them.

brianpardy · 2020-03-05T16:58:42Z

You need to be on the main 'browse' screen that lists all of the deposited sequences, not the individual-sample screen that contains the 'download metadata' and 'download fasta' buttons. The button is just labeled "Download" with an icon on it, to the right of the screen paging tools.

xzhuo · 2020-03-05T17:06:48Z

Thank you very much! do you mean the excel table? I can download an excel table with all the entries by clicking "Download Acknowledgement Table for all submissions here". But I still cannot get a fasta file...

brianpardy · 2020-03-05T17:11:51Z

It looks like the download button is not appearing for you at all. Are you able to scroll your screen to the right? The page I see includes a download button as shown below.

xzhuo · 2020-03-05T17:16:39Z

No, I don't have that button. Thank you both very much for replying! Now I have to try something else.

melkebir · 2020-03-11T20:23:17Z

@xzhuo : Same issue for me, gisaid removed the download button. Did you figure out an alternative solution?

xzhuo · 2020-03-12T04:12:37Z

Not yet. A crawler?

pedroelbanquero · 2020-03-14T09:13:43Z

why no add the fasta of sars, mers and the others of the family ?, when is a new virus ?

wwydmanski · 2020-03-14T12:45:18Z

@pedroelbanquero sure, it can be done at one point. For now I've added parsing metadata of the samples to the scrapper, it should yield some interesting information

rvosa · 2020-03-14T12:59:04Z

I've been having trouble getting @wwydmanski's scraper to work (it errors at the end of the first page because the DOM seems to have changed). @melkebir's scraper does work. This is both on a macbook 10.14.6.

wwydmanski · 2020-03-14T13:07:04Z

@rvosa maybe it's OS dependent? I've tested it only on windows 10

Add normalize_gisaid_fasta.sh for issue #53

trvrb · 2020-03-15T01:31:00Z

I'll leave this issue open for discussion. If you successfully download gisaid_cov2020_sequences.fasta from GISAID then the merged #59 should make preparation of sequences.fasta straight forward. You can run

./scripts/normalize_gisaid_fasta.sh data/gisaid_cov2020_sequences.fasta data/sequences.fasta

and then just proceed with snakemake -p or nextstrain build. We've done additional curation on top of GISAID's but this is all visible in the metadata.tsv file.

I don't understand what's going on with some people being able to see a "Download" button and others not able to. I'd suggest to continue to contact GISAID support about this.

brianpardy · 2020-03-15T14:59:54Z

Thank you for the merge, @trvrb!

As another followup to this, users running local nextstrain/ncov instances based on the normalized GISAID fasta download may notice inconsistencies in their local results vs those on the nextstrain.org site. Occasionally sequences released on GISAID are later withdrawn or set as non-public, at which point they no longer appear in the gisaid_cov2020_sequences.fasta file provided by GISAID. Nextstrain itself appears to be using an independent archive that does not always immediately reflect the removal of sequences from GISAID (though it has in the past).

For example, the current GISAID download lacks many of the Guangdong sequences from March 9th currently visible on nextstrain.org/ncov.

tolot27 · 2020-03-15T17:53:02Z

I don't understand what's going on with some people being able to see a "Download" button and others not able to. I'd suggest to continue to contact GISAID support about this.

It looks like GISAID made some correction after contacting them via E-Mail. Now I see the Download button again. 😃

abitrolly · 2020-03-16T13:14:32Z

Is it possible to submit this validator script to GISAID to improve the data quality on their side?

ZeweiSong · 2020-03-25T10:19:20Z

Filing this as an issue as suggested by @emmahodcroft:

GISAID provides an all-sequences download button for SARS-CoV-2 sequences. The provided file is not directly usable as a sequences.fasta file in nextstrain/ncov because of several issues in the GISAID file:
1. There is at least one duplicate sequence name (Italy/INMI1/2020) that causes errors in augur filter, and often other duplicates exist before they are renamed on GISAID

2. There are several sequences with "Hong Kong" in their names that cause errors in augur filter due to the sequence name being truncated at whitespace

3. The sequence names are appended with the EPI_ISL identifier and a datestamp, which are not stripped when loaded and cause mismatches with sequence names in metadata.tsv

4. The sequence names are prepended with 'BetaCoV' or 'BetaCov' which is not stripped when loaded and causes mismatches with metadata.tsv
I suggest a new bash script in ncov/scripts/ that would optionally normalize the GISAID all-sequences download file so that users can use it directly without a need to manually remove duplicates or edit sequence names or maintain their own automated pipeline to generate data/sequences.fasta.

I also suggest automating this in the Snakefile if possible, but I'm not sure how. If no data/sequences.fasta file exists, but data/gisaid_cov2020_sequences.fasta exists, run scripts/normalize-gisaid-fasta.sh before the rest of the pipeline.

I can likely have a base script written to do the normalization by this evening if there is interest.

Thank you for considering this idea.

I cannot find the download all button on 2020/3/25, maybe they just removed it? The only way I can check the sequence is by Browsering, but that means to download one record at a time.

Anyone else has the same problem?

rvosa · 2020-03-25T16:28:12Z

The sequence availability issue is something that is problematic beyond nextstrain per se. Perhaps it makes sense if someone from the nextstrain core kept an eye on the activities towards data sharing that are being developed by the participants of the covid-19 biohackathon.

TrentBrick · 2020-03-25T19:51:25Z

ZeweiSong, you need to email them. They should enable it for you then. Not at all clear why this is the case -- super frustrating in fact. But this is what happened for me. (also don't expect them to email you back, just check again 24 hours later and see if the button appears).

melkebir · 2020-03-26T14:30:49Z

Seconding @TrentBrick and @trvrb messages -- best way forward is to contact GISAID and request access. Please do not use scrapers -- with increasing number of sequences and number of interested users this would essentially amount to a denial of service.

victorlin · 2020-03-26T14:35:43Z

Not sure why my comment was removed - calling the Javascript function that triggers the download should be just as costly as using the download button itself. You would still need access to the page in the first place.

rvosa · 2020-03-26T21:12:17Z

I'm gearing up to formulate a request for data access and sharing on behalf of the biohackathon (there's a special covid-19 edition starting soon). I asked GISAID on twitter but I don't think they're very active there. I've had some interaction via their issue tracker so I'll next try in that way.

Would it make sense to ask on behalf of (or with reference to) the nextstrain user community at the same time? Please let me know if I should do that.

The general idea is not to nag or complain. I'm sure they're very busy right now. Also, I imagine they are simply under existing agreements with data submitters that they have to comply with. However, maybe there are other ways in which they can meet their obligations and still accomplish data access with less friction. That will probably involve both technical implementation and social busywork. It seems to me that there are many people willing and able to help with both of these right now.

Something structural needs to improve that we mustn't try to address with screen scrapers and javascript backdoors. More and more researchers want to do good work with these data. It is part of GISAID's stated mission to enable that. We ought to work together to make that possible in an open and collaborative way.

abitrolly · 2020-03-27T05:38:33Z

@rvosa maybe they (GISAID) think that the data will be used in a malicious way? Because if not, then maybe there is insufficient funding and poor technical excellence to avoid DoS. E.g. setup memcache.

palatos · 2020-03-27T18:21:19Z

Is anyone else having trouble accessing GISAID right now? It was hard for me to create an account, but now that I have one the ncov tab just doesn't load. I'm not sure why it's so hard to obtain the sequences. Makes analyzing the data so much harder compared to the ones deposited in genbank.

vscooper · 2020-03-30T02:16:56Z

No trouble creating an account, but download requests keep throwing errors.

Add detailed steps for how to obtain and normalize GISAID fasta, as resolved in nextstrain#53 and nextstrain#59 .

oneillkza · 2020-03-31T21:46:33Z

@palatos it's been up and down for me. Just keep re-trying. Fortunately the actual fasta download is pretty small and quick once you get in.

woson2020 · 2020-04-04T07:20:53Z

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

woson2020 · 2020-04-04T07:33:27Z

@xzhuo Are you able to download all genome sequence of ncov?

canholyavkin · 2020-04-04T08:11:33Z

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

@woson2020, as suggested above you'll need to request download access from GISAID. After you login, you can send a message through Contact page. They generally gave the access in a 1-2 days.

woson2020 · 2020-04-04T08:21:29Z

@CAC

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

@woson2020, as suggested above you'll need to request download access from GISAID. After you login, you can send a message through Contact page. They generally gave the access in a 1-2 days.

I have already sent message,thank you for your help.

craic · 2020-04-11T15:36:13Z

I requested access yesterday and they enabled the download feature overnight. Just send them a polite note on their contact form describing why you need to access all the data. In my case I need it to validate a diagnostic against the latest isolates.

When you get the data you will need to do some basic cleanup. Some of the header lines have spaces in the ID e.g. 'Hong Kong' rather than Hong_Kong. Some sequences have control characters and there are a few other anomalies. I'll post my cleanup script later if I get the chance - it's just basic regexp stuff.

craic · 2020-04-11T15:54:16Z

Here is my cleanup script as a gist

https://gist.github.com/craic/790e57e3ea140797d66a9dccaaa098a2

woson2020 · 2020-04-11T16:05:05Z

Thank very much for your reply. And I have wrote a python script of selenium package(Web automation simulation click) to download all fasta file from GISAID last week.

…

------------------ Zhuoxing Wu PhD Candidate,Sun Yat-sen University  Address:No.74,Zhong Shan 2nd Road,Yuexiu District,Guangzhou,Guangdong Province,510080,P.R.China Email:[email protected]       ------------------ Original ------------------ From:  "Rob Jones"<[email protected]>; Date:  Sat, Apr 11, 2020 11:36 PM To:  "nextstrain/ncov"<[email protected]>; Cc:  "woson2020"<[email protected]>; "Mention"<[email protected]>; Subject:  Re: [nextstrain/ncov] GISAID all-sequences fasta should be directly usable by nextstrain/ncov (#53)   I requested access yesterday and they enabled the download feature overnight. Just send them a polite note on their contact form describing why you need to access all the data. In my case I need it to validate a diagnostic against the latest isolates. When you get the data you will need to do some basic cleanup. Some of the header lines have spaces in the ID e.g. 'Hong Kong' rather than Hong_Kong. Some sequences have control characters and there are a few other anomalies. I'll post my cleanup script later if I get the chance - it's just basic regexp stuff. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

emmahodcroft · 2020-04-11T16:09:25Z

Hi all, just a reminder that scrapers are harmful to the functioning of GISAID. We and they ask kindly that you do not use them.

abitrolly · 2020-04-12T07:13:46Z

@emmahodcroft I can fix the problem with downloads and data quality if I understand what's going on on GISAID side, but I can not reach them. Do you have any contact with them?

emmahodcroft · 2020-04-12T09:23:27Z

No, I'm afraid my own lines of contact are through the same portals that are available publicly.

animesh-workplace · 2020-04-21T10:33:34Z

I am not able to understand why USA/WA1/2020 was used to find the status of haplotype. Used in script annotate-haplotype-status.py

ps120195 · 2020-04-25T09:32:19Z

I am not able to see Download acknowledgement table here link, which was there earlier from last couple of days.Also it is throwing internal error frequently.I messaged them by contact ,still the issue is not resolved.Please help me regarding this.

CAC · 2020-04-26T15:47:47Z

Can you unsubscribe me from your mailing list?

On Sat, Apr 25, 2020 at 2:29 AM ps120195 ***@***.***> wrote: Thank you very much! do you mean the excel table? I can download an excel table with all the entries by clicking "Download Acknowledgement Table for all submissions here". But I still cannot get a fasta file... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAGNA3EGRM5CXNOPF6PAYDROKUPRANCNFSM4KX3P2OA> .

-- Cherrian Angela Chin

abitrolly · 2020-04-27T11:22:32Z

Can you unsubscribe me from your mailing list?

Click this --> https://github.com/notifications/unsubscribe-auth/AAAGNA3EGRM5CXNOPF6PAYDROKUPRANCNFSM4KX3P2OA

Buy me a coffee if it works. ,)

trvrb · 2021-01-21T21:00:13Z

Closing this issue. metadata.tsv and sequences.fasta are now directly available through GISAID.org.

abitrolly · 2021-01-22T07:33:54Z

@trvrb that some good news. I miss the "data journalism" that could reveal the full story. )

brianpardy pushed a commit to brianpardy/ncov that referenced this issue Feb 25, 2020

Add normalize_gisaid_fasta.sh for issue nextstrain#53

8e896c8

brianpardy pushed a commit to brianpardy/ncov that referenced this issue Feb 25, 2020

Add normalize_gisaid_fasta.sh for issue nextstrain#53

d3c90c7

trvrb added a commit that referenced this issue Mar 15, 2020

Merge pull request #59 from brianpardy/gisaid_fasta

4ee884d

Add normalize_gisaid_fasta.sh for issue #53

oneillkza added a commit to oneillkza/ncov that referenced this issue Mar 31, 2020

Explain how to use GISAID data

132ec9a

Add detailed steps for how to obtain and normalize GISAID fasta, as resolved in nextstrain#53 and nextstrain#59 .

oneillkza mentioned this issue Mar 31, 2020

Explain how to use GISAID data #323

Closed

brianpardy mentioned this issue Apr 10, 2020

handle >hCoV/Kuwait/KU001/2020 sequence in GISAID fasta #352

Closed

trvrb closed this as completed Jan 21, 2021

GISAID all-sequences fasta should be directly usable by nextstrain/ncov #53

GISAID all-sequences fasta should be directly usable by nextstrain/ncov #53

Comments

brianpardy commented Feb 19, 2020

emmahodcroft commented Feb 19, 2020

emmahodcroft commented Feb 19, 2020

brianpardy commented Feb 19, 2020

brianpardy commented Feb 19, 2020

emmahodcroft commented Feb 20, 2020

brianpardy commented Feb 20, 2020

brianpardy commented Feb 25, 2020

brianpardy commented Feb 25, 2020

emmahodcroft commented Feb 25, 2020

xzhuo commented Mar 5, 2020

emmahodcroft commented Mar 5, 2020

emmahodcroft commented Mar 5, 2020

xzhuo commented Mar 5, 2020

brianpardy commented Mar 5, 2020

xzhuo commented Mar 5, 2020 • edited Loading

brianpardy commented Mar 5, 2020 • edited Loading

xzhuo commented Mar 5, 2020

melkebir commented Mar 11, 2020

xzhuo commented Mar 12, 2020

pedroelbanquero commented Mar 14, 2020

wwydmanski commented Mar 14, 2020

rvosa commented Mar 14, 2020

wwydmanski commented Mar 14, 2020

trvrb commented Mar 15, 2020 • edited Loading

brianpardy commented Mar 15, 2020

tolot27 commented Mar 15, 2020

abitrolly commented Mar 16, 2020

ZeweiSong commented Mar 25, 2020

rvosa commented Mar 25, 2020

TrentBrick commented Mar 25, 2020

melkebir commented Mar 26, 2020

victorlin commented Mar 26, 2020

rvosa commented Mar 26, 2020

abitrolly commented Mar 27, 2020 • edited Loading

palatos commented Mar 27, 2020

vscooper commented Mar 30, 2020

oneillkza commented Mar 31, 2020

woson2020 commented Apr 4, 2020

woson2020 commented Apr 4, 2020

canholyavkin commented Apr 4, 2020

woson2020 commented Apr 4, 2020

craic commented Apr 11, 2020

craic commented Apr 11, 2020

woson2020 commented Apr 11, 2020 via email

emmahodcroft commented Apr 11, 2020

abitrolly commented Apr 12, 2020

emmahodcroft commented Apr 12, 2020

animesh-workplace commented Apr 21, 2020

ps120195 commented Apr 25, 2020

CAC commented Apr 26, 2020 via email

abitrolly commented Apr 27, 2020

trvrb commented Jan 21, 2021

abitrolly commented Jan 22, 2021

xzhuo commented Mar 5, 2020 •

edited

Loading

brianpardy commented Mar 5, 2020 •

edited

Loading

trvrb commented Mar 15, 2020 •

edited

Loading

abitrolly commented Mar 27, 2020 •

edited

Loading