Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FASTQ_OPT_NCBI option for parsing of NCBI's SRA data. #1325

Merged
merged 2 commits into from
Sep 22, 2021

Conversation

jkbonfield
Copy link
Contributor

This variant of FASTQ has the read name as the second field on the name line, just to be awkward.

@daviesrob daviesrob self-assigned this Aug 31, 2021
Copy link
Member

@daviesrob daviesrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should be specifically calling out the NCBI here. The SRA toolkit fastq-dump program has a --origfmt option that outputs the original read names if they've been stored, so it is in theory possible to work around the problem using NCBI tools. In practice it seems that their archive doesn't store the names anyway, so neither fastq-dump nor this PR can get them back.

The ENA on the other hand does store the read names, at least for data directly submitted to them, so this option is more useful for their data. Compare for example ERR4204010 from the NCBI and ENA.

Although I'm not entirely sure what we could call the option instead. Maybe FASTQ_OPT_SECOND_NAME or FASTQ_OPT_NAME2?

sam.c Outdated
Comment on lines 3718 to 3736

// Reverse the NCBI strangeness of putting the run_name.number before
// the read name.
i = 0;
char *name = x->name.s+1;
if (x->ncbi_names) {
char *cp = strchr(x->name.s, ' ');
if (cp) {
*cp = '@';
i = cp - x->name.s;
name = cp+1;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor point, but it might be good to move this after the x->nprefix test below. As it is, it will "fix" files with lines that don't have the correct prefix, but do start with a leading space, which may or may not be a good thing...

Also, how general do we want this. Should we be looking for multiple spaces or tabs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could rename it FASTQ_OPT_SRA. I don't really like second name as it's long winded and like it or not this is an SRA specific format. ENA only use it because SRA use it.

Copy link
Contributor Author

@jkbonfield jkbonfield Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'll take that back, I see what you mean. Maybe ERA fixed SRA format by still including the data rather than totally discarding it (which would be tragic as it removes the ability to dedup). I'm not really sure. SRA came first though so I mentally peg this as their error. I'll ponder.

@jkbonfield
Copy link
Contributor Author

The code has been revised.

  • Ordering of code adjusted as suggested. It now also recognises tab and any number of space/tab. (Although the specific erroneous SRA format-within-format uses 1 space only).

  • Renamed the user-visible option to "fastq_name2" and enum to FASTQ_OPT_NAME2.

  • Renamed the internal variable to "sra_names". They have to carry the blame somewhere :-)

jkbonfield and others added 2 commits September 22, 2021 12:41
This variant of FASTQ has the read name as the second field on the
name line, just to be awkward.
@daviesrob
Copy link
Member

Looks OK. I've squashed & rebased, plus added a couple of simple tests to exercise the option in a new commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants