Add FASTQ_OPT_NCBI option for parsing of NCBI's SRA data. #1325

jkbonfield · 2021-08-26T08:57:09Z

This variant of FASTQ has the read name as the second field on the name line, just to be awkward.

daviesrob

I'm not sure if we should be specifically calling out the NCBI here. The SRA toolkit fastq-dump program has a --origfmt option that outputs the original read names if they've been stored, so it is in theory possible to work around the problem using NCBI tools. In practice it seems that their archive doesn't store the names anyway, so neither fastq-dump nor this PR can get them back.

The ENA on the other hand does store the read names, at least for data directly submitted to them, so this option is more useful for their data. Compare for example ERR4204010 from the NCBI and ENA.

Although I'm not entirely sure what we could call the option instead. Maybe FASTQ_OPT_SECOND_NAME or FASTQ_OPT_NAME2?

daviesrob · 2021-09-03T17:18:44Z

sam.c

+
+    // Reverse the NCBI strangeness of putting the run_name.number before
+    // the read name.
+    i = 0;
+    char *name = x->name.s+1;
+    if (x->ncbi_names) {
+        char *cp = strchr(x->name.s, ' ');
+        if (cp) {
+            *cp = '@';
+            i = cp - x->name.s;
+            name = cp+1;
+        }
+    }
+


A minor point, but it might be good to move this after the x->nprefix test below. As it is, it will "fix" files with lines that don't have the correct prefix, but do start with a leading space, which may or may not be a good thing...

Also, how general do we want this. Should we be looking for multiple spaces or tabs?

I could rename it FASTQ_OPT_SRA. I don't really like second name as it's long winded and like it or not this is an SRA specific format. ENA only use it because SRA use it.

Actually I'll take that back, I see what you mean. Maybe ERA fixed SRA format by still including the data rather than totally discarding it (which would be tragic as it removes the ability to dedup). I'm not really sure. SRA came first though so I mentally peg this as their error. I'll ponder.

jkbonfield · 2021-09-15T16:10:32Z

The code has been revised.

Ordering of code adjusted as suggested. It now also recognises tab and any number of space/tab. (Although the specific erroneous SRA format-within-format uses 1 space only).
Renamed the user-visible option to "fastq_name2" and enum to FASTQ_OPT_NAME2.
Renamed the internal variable to "sra_names". They have to carry the blame somewhere :-)

This variant of FASTQ has the read name as the second field on the name line, just to be awkward.

daviesrob · 2021-09-22T13:30:49Z

Looks OK. I've squashed & rebased, plus added a couple of simple tests to exercise the option in a new commit.

daviesrob self-assigned this Aug 31, 2021

daviesrob reviewed Sep 3, 2021

View reviewed changes

jmarshall mentioned this pull request Sep 9, 2021

after latest pull, make install fails samtools/bcftools#1573

Closed

jkbonfield and others added 2 commits September 22, 2021 12:41

Add FASTQ_OPT_NAME2 option for parsing of SRA data.

a78b287

This variant of FASTQ has the read name as the second field on the name line, just to be awkward.

Add fastq_name2 option tests

4a79f25

daviesrob force-pushed the fastq_sra branch from e240d78 to 4a79f25 Compare September 22, 2021 13:29

daviesrob merged commit 4a79f25 into samtools:develop Sep 22, 2021

jmarshall mentioned this pull request Sep 28, 2021

Fix hts_idx_get_n_no_coor() out-of-bounds memory access #1335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FASTQ_OPT_NCBI option for parsing of NCBI's SRA data. #1325

Add FASTQ_OPT_NCBI option for parsing of NCBI's SRA data. #1325

jkbonfield commented Aug 26, 2021

daviesrob left a comment

daviesrob Sep 3, 2021

jkbonfield Sep 7, 2021

jkbonfield Sep 7, 2021 •

edited

Loading

jkbonfield commented Sep 15, 2021

daviesrob commented Sep 22, 2021

Add FASTQ_OPT_NCBI option for parsing of NCBI's SRA data. #1325

Add FASTQ_OPT_NCBI option for parsing of NCBI's SRA data. #1325

Conversation

jkbonfield commented Aug 26, 2021

daviesrob left a comment

Choose a reason for hiding this comment

daviesrob Sep 3, 2021

Choose a reason for hiding this comment

jkbonfield Sep 7, 2021

Choose a reason for hiding this comment

jkbonfield Sep 7, 2021 • edited Loading

Choose a reason for hiding this comment

jkbonfield commented Sep 15, 2021

daviesrob commented Sep 22, 2021

jkbonfield Sep 7, 2021 •

edited

Loading