Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GISAID all-sequences fasta should be directly usable by nextstrain/ncov #53

Closed
brianpardy opened this issue Feb 19, 2020 · 63 comments
Closed

Comments

@brianpardy
Copy link

Filing this as an issue as suggested by @emmahodcroft:

GISAID provides an all-sequences download button for SARS-CoV-2 sequences. The provided file is not directly usable as a sequences.fasta file in nextstrain/ncov because of several issues in the GISAID file:

  1. There is at least one duplicate sequence name (Italy/INMI1/2020) that causes errors in augur filter, and often other duplicates exist before they are renamed on GISAID
  2. There are several sequences with "Hong Kong" in their names that cause errors in augur filter due to the sequence name being truncated at whitespace
  3. The sequence names are appended with the EPI_ISL identifier and a datestamp, which are not stripped when loaded and cause mismatches with sequence names in metadata.tsv
  4. The sequence names are prepended with 'BetaCoV' or 'BetaCov' which is not stripped when loaded and causes mismatches with metadata.tsv

I suggest a new bash script in ncov/scripts/ that would optionally normalize the GISAID all-sequences download file so that users can use it directly without a need to manually remove duplicates or edit sequence names or maintain their own automated pipeline to generate data/sequences.fasta.

I also suggest automating this in the Snakefile if possible, but I'm not sure how. If no data/sequences.fasta file exists, but data/gisaid_cov2020_sequences.fasta exists, run scripts/normalize-gisaid-fasta.sh before the rest of the pipeline.

I can likely have a base script written to do the normalization by this evening if there is interest.

Thank you for considering this idea.

@emmahodcroft
Copy link
Member

Raised as an idea after issue #52 and the thought that more such error issues may come our way from those using the file download.

@emmahodcroft
Copy link
Member

Thanks @brianpardy , I've linked to this on our internal Nextstrain convos so we can consider :)

@brianpardy
Copy link
Author

Hi @emmahodcroft, I did go ahead and create a simple script that works on my local install using the current gisaid_cov2020_sequences.fasta file. I committed it to my fork with brianpardy@b401051

It uses only cat, sed, awk, and grep. Call as:
scripts/normalize_gisaid_fasta.sh data/gisaid_cov2020_sequences.fasta data/sequences.fasta

@brianpardy
Copy link
Author

It is certainly the wrong way to do it but I also added a Snakefile rule called 'gisaid' that will run this script to create sequences.fasta from gisaid_cov2020_sequences.fasta. I don't know enough to change the Snakefile to replace the download rule with the gisaid rule, but calling "snakemake gisaid" on my copy will generate sequences.fasta, and "snakemake -f gisaid" will regenerate it when a new download from gisaid is placed in data/.

@emmahodcroft
Copy link
Member

Thanks @brianpardy ! We are looking into some possible solutions here. We are going to try and make this work better, but we'll need to iron out some details on how to organise that :)

@brianpardy
Copy link
Author

That sounds good to me, @emmahodcroft, if I helped spur some thought I'm happy. I did make one more change to my script on the embedded spaces item, I noticed the "Hong Kong" sequences were in metadata.tsv with the space removed, not converted to underscore, so they could not be matched, I fixed that. I'll also add some error checking for calls without naming the files on the commandline. If the team elects to use this great, if not I still appreciate having the issue considered.

@brianpardy
Copy link
Author

As @jameshadfield mentioned this issue in #57 I should add that as of right now the script I offered does not work perfectly with the current all sequences file from GISAID: the three new Hong Kong sequences EPI_ISL_412028, EPI_ISL_412029, and EPI_ISL_412030 have duplicate strain names to earlier submissions EPI_ISL_408975, EPI_ISL_409020, and EPI_ISL_409024. The awk statement in my normalize_gisaid_fasta script keeps only the first instance of a duplicate strain name and discards all additional instances. When run, my script will currently only keep the earlier, partial Spike glycoprotein sequences and will discard the newer, complete genomes. For the moment I am manually removing those three partial sequences from the GISAID download before running my script.

I wanted to keep the script simple and obvious but it could probably be extended to keep the longest sequence found instead of the first, at the expense of readability and complexity.

brianpardy pushed a commit to brianpardy/ncov that referenced this issue Feb 25, 2020
brianpardy pushed a commit to brianpardy/ncov that referenced this issue Feb 25, 2020
@brianpardy
Copy link
Author

Sorry about those extra commits showing up on the issue log, I'm learning how to deal with branching properly so I can submit a pull request and I was not expecting that quite yet. Please ignore the first one.

I updated my script to resolve this issue. I added a 3rd commandline parameter for minimum length that defaults to 15000. I am calling it from my Snakefile using params.min_length and it is working fine. This resolves, for now, the problem of normalize_gisaid_fasta.sh keeping the first appearing, shorter sequence, instead of the later appearing, complete sequence, when sequence names collide.

I set this up on a clean branch on my fork that should merge cleanly if the team accepts the pull request I am about to submit for commit brianpardy@d3c90c7

No offense taken if unwanted.

@emmahodcroft
Copy link
Member

Hi @brianpardy , thank you for the work! Yes, these are the same issues we are running into on our end. We're still trying to figure out the best way to deal with this both for public users and for our own internal builds (which need to be aligned between all of us who update Nextstrain, etc, so are a bit more complicated). We're all a bit short on time at the moment unfortunately, so progress is slow - sorry!

@xzhuo
Copy link

xzhuo commented Mar 5, 2020

It may sound silly but I have to ask: where is the "all-sequences download button for SARS-CoV-2 sequences" in GISAID? I could not find it...

@emmahodcroft
Copy link
Member

On GISAID, in the EpiCoV tab - bottom right 'Download' button, under the table.

@emmahodcroft
Copy link
Member

You will need a GISAID account to do this.

@xzhuo
Copy link

xzhuo commented Mar 5, 2020

I registered. I can see each entry with a "download metadata" and a "download fasta" button. But I could not find a button to download all of them.

@brianpardy
Copy link
Author

You need to be on the main 'browse' screen that lists all of the deposited sequences, not the individual-sample screen that contains the 'download metadata' and 'download fasta' buttons. The button is just labeled "Download" with an icon on it, to the right of the screen paging tools.

@xzhuo
Copy link

xzhuo commented Mar 5, 2020

Thank you very much! do you mean the excel table? I can download an excel table with all the entries by clicking "Download Acknowledgement Table for all submissions here". But I still cannot get a fasta file...

@brianpardy
Copy link
Author

brianpardy commented Mar 5, 2020

It looks like the download button is not appearing for you at all. Are you able to scroll your screen to the right? The page I see includes a download button as shown below.

@xzhuo
Copy link

xzhuo commented Mar 5, 2020

No, I don't have that button. Thank you both very much for replying! Now I have to try something else.

@melkebir
Copy link

@xzhuo : Same issue for me, gisaid removed the download button. Did you figure out an alternative solution?

@xzhuo
Copy link

xzhuo commented Mar 12, 2020

Not yet. A crawler?

@pedroelbanquero
Copy link

why no add the fasta of sars, mers and the others of the family ?, when is a new virus ?

@wwydmanski
Copy link

@pedroelbanquero sure, it can be done at one point. For now I've added parsing metadata of the samples to the scrapper, it should yield some interesting information

@rvosa
Copy link
Contributor

rvosa commented Mar 14, 2020

I've been having trouble getting @wwydmanski's scraper to work (it errors at the end of the first page because the DOM seems to have changed). @melkebir's scraper does work. This is both on a macbook 10.14.6.

@wwydmanski
Copy link

@rvosa maybe it's OS dependent? I've tested it only on windows 10

trvrb added a commit that referenced this issue Mar 15, 2020
Add normalize_gisaid_fasta.sh for issue #53
@trvrb
Copy link
Member

trvrb commented Mar 15, 2020

I'll leave this issue open for discussion. If you successfully download gisaid_cov2020_sequences.fasta from GISAID then the merged #59 should make preparation of sequences.fasta straight forward. You can run

./scripts/normalize_gisaid_fasta.sh data/gisaid_cov2020_sequences.fasta data/sequences.fasta

and then just proceed with snakemake -p or nextstrain build. We've done additional curation on top of GISAID's but this is all visible in the metadata.tsv file.

I don't understand what's going on with some people being able to see a "Download" button and others not able to. I'd suggest to continue to contact GISAID support about this.

@brianpardy
Copy link
Author

Thank you for the merge, @trvrb!

As another followup to this, users running local nextstrain/ncov instances based on the normalized GISAID fasta download may notice inconsistencies in their local results vs those on the nextstrain.org site. Occasionally sequences released on GISAID are later withdrawn or set as non-public, at which point they no longer appear in the gisaid_cov2020_sequences.fasta file provided by GISAID. Nextstrain itself appears to be using an independent archive that does not always immediately reflect the removal of sequences from GISAID (though it has in the past).

For example, the current GISAID download lacks many of the Guangdong sequences from March 9th currently visible on nextstrain.org/ncov.

@tolot27
Copy link
Contributor

tolot27 commented Mar 15, 2020

I don't understand what's going on with some people being able to see a "Download" button and others not able to. I'd suggest to continue to contact GISAID support about this.

It looks like GISAID made some correction after contacting them via E-Mail. Now I see the Download button again. 😃

@abitrolly
Copy link

Is it possible to submit this validator script to GISAID to improve the data quality on their side?

@ZeweiSong
Copy link

Filing this as an issue as suggested by @emmahodcroft:

GISAID provides an all-sequences download button for SARS-CoV-2 sequences. The provided file is not directly usable as a sequences.fasta file in nextstrain/ncov because of several issues in the GISAID file:

1. There is at least one duplicate sequence name (Italy/INMI1/2020) that causes errors in augur filter, and often other duplicates exist before they are renamed on GISAID

2. There are several sequences with "Hong Kong" in their names that cause errors in augur filter due to the sequence name being truncated at whitespace

3. The sequence names are appended with the EPI_ISL identifier and a datestamp, which are not stripped when loaded and cause mismatches with sequence names in metadata.tsv

4. The sequence names are prepended with 'BetaCoV' or 'BetaCov' which is not stripped when loaded and causes mismatches with metadata.tsv

I suggest a new bash script in ncov/scripts/ that would optionally normalize the GISAID all-sequences download file so that users can use it directly without a need to manually remove duplicates or edit sequence names or maintain their own automated pipeline to generate data/sequences.fasta.

I also suggest automating this in the Snakefile if possible, but I'm not sure how. If no data/sequences.fasta file exists, but data/gisaid_cov2020_sequences.fasta exists, run scripts/normalize-gisaid-fasta.sh before the rest of the pipeline.

I can likely have a base script written to do the normalization by this evening if there is interest.

Thank you for considering this idea.

I cannot find the download all button on 2020/3/25, maybe they just removed it? The only way I can check the sequence is by Browsering, but that means to download one record at a time.

Anyone else has the same problem?

@rvosa
Copy link
Contributor

rvosa commented Mar 25, 2020

The sequence availability issue is something that is problematic beyond nextstrain per se. Perhaps it makes sense if someone from the nextstrain core kept an eye on the activities towards data sharing that are being developed by the participants of the covid-19 biohackathon.

@TrentBrick
Copy link

ZeweiSong, you need to email them. They should enable it for you then. Not at all clear why this is the case -- super frustrating in fact. But this is what happened for me. (also don't expect them to email you back, just check again 24 hours later and see if the button appears).

@melkebir
Copy link

Seconding @TrentBrick and @trvrb messages -- best way forward is to contact GISAID and request access. Please do not use scrapers -- with increasing number of sequences and number of interested users this would essentially amount to a denial of service.

@victorlin
Copy link
Member

Not sure why my comment was removed - calling the Javascript function that triggers the download should be just as costly as using the download button itself. You would still need access to the page in the first place.

@rvosa
Copy link
Contributor

rvosa commented Mar 26, 2020

I'm gearing up to formulate a request for data access and sharing on behalf of the biohackathon (there's a special covid-19 edition starting soon). I asked GISAID on twitter but I don't think they're very active there. I've had some interaction via their issue tracker so I'll next try in that way.

Would it make sense to ask on behalf of (or with reference to) the nextstrain user community at the same time? Please let me know if I should do that.

The general idea is not to nag or complain. I'm sure they're very busy right now. Also, I imagine they are simply under existing agreements with data submitters that they have to comply with. However, maybe there are other ways in which they can meet their obligations and still accomplish data access with less friction. That will probably involve both technical implementation and social busywork. It seems to me that there are many people willing and able to help with both of these right now.

Something structural needs to improve that we mustn't try to address with screen scrapers and javascript backdoors. More and more researchers want to do good work with these data. It is part of GISAID's stated mission to enable that. We ought to work together to make that possible in an open and collaborative way.

@abitrolly
Copy link

abitrolly commented Mar 27, 2020

@rvosa maybe they (GISAID) think that the data will be used in a malicious way? Because if not, then maybe there is insufficient funding and poor technical excellence to avoid DoS. E.g. setup memcache.

@palatos
Copy link

palatos commented Mar 27, 2020

Is anyone else having trouble accessing GISAID right now? It was hard for me to create an account, but now that I have one the ncov tab just doesn't load. I'm not sure why it's so hard to obtain the sequences. Makes analyzing the data so much harder compared to the ones deposited in genbank.

@vscooper
Copy link

No trouble creating an account, but download requests keep throwing errors.

oneillkza added a commit to oneillkza/ncov that referenced this issue Mar 31, 2020
Add detailed steps for how to obtain and normalize GISAID fasta, as resolved in nextstrain#53 and nextstrain#59 .
@oneillkza
Copy link

@palatos it's been up and down for me. Just keep re-trying. Fortunately the actual fasta download is pretty small and quick once you get in.

@woson2020
Copy link

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

@woson2020
Copy link

@xzhuo Are you able to download all genome sequence of ncov?

@canholyavkin
Copy link

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

@woson2020, as suggested above you'll need to request download access from GISAID. After you login, you can send a message through Contact page. They generally gave the access in a 1-2 days.

@woson2020
Copy link

@CAC

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

@woson2020, as suggested above you'll need to request download access from GISAID. After you login, you can send a message through Contact page. They generally gave the access in a 1-2 days.

I have already sent message,thank you for your help.

@craic
Copy link

craic commented Apr 11, 2020

I requested access yesterday and they enabled the download feature overnight. Just send them a polite note on their contact form describing why you need to access all the data. In my case I need it to validate a diagnostic against the latest isolates.

When you get the data you will need to do some basic cleanup. Some of the header lines have spaces in the ID e.g. 'Hong Kong' rather than Hong_Kong. Some sequences have control characters and there are a few other anomalies. I'll post my cleanup script later if I get the chance - it's just basic regexp stuff.

@craic
Copy link

craic commented Apr 11, 2020

Here is my cleanup script as a gist

https://gist.github.com/craic/790e57e3ea140797d66a9dccaaa098a2

@woson2020
Copy link

woson2020 commented Apr 11, 2020 via email

@emmahodcroft
Copy link
Member

Hi all, just a reminder that scrapers are harmful to the functioning of GISAID. We and they ask kindly that you do not use them.

@abitrolly
Copy link

@emmahodcroft I can fix the problem with downloads and data quality if I understand what's going on on GISAID side, but I can not reach them. Do you have any contact with them?

@emmahodcroft
Copy link
Member

No, I'm afraid my own lines of contact are through the same portals that are available publicly.

@animesh-workplace
Copy link

I am not able to understand why USA/WA1/2020 was used to find the status of haplotype. Used in script annotate-haplotype-status.py

@ps120195
Copy link

I am not able to see Download acknowledgement table here link, which was there earlier from last couple of days.Also it is throwing internal error frequently.I messaged them by contact ,still the issue is not resolved.Please help me regarding this.

@CAC
Copy link

CAC commented Apr 26, 2020 via email

@abitrolly
Copy link

Can you unsubscribe me from your mailing list?

Click this --> https://github.com/notifications/unsubscribe-auth/AAAGNA3EGRM5CXNOPF6PAYDROKUPRANCNFSM4KX3P2OA

Buy me a coffee if it works. ,)

@trvrb
Copy link
Member

trvrb commented Jan 21, 2021

Closing this issue. metadata.tsv and sequences.fasta are now directly available through GISAID.org.

@trvrb trvrb closed this as completed Jan 21, 2021
@abitrolly
Copy link

@trvrb that some good news. I miss the "data journalism" that could reveal the full story. )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.