-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FASTQ_OPT_NCBI option for parsing of NCBI's SRA data. #1325
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should be specifically calling out the NCBI here. The SRA toolkit fastq-dump
program has a --origfmt
option that outputs the original read names if they've been stored, so it is in theory possible to work around the problem using NCBI tools. In practice it seems that their archive doesn't store the names anyway, so neither fastq-dump
nor this PR can get them back.
The ENA on the other hand does store the read names, at least for data directly submitted to them, so this option is more useful for their data. Compare for example ERR4204010 from the NCBI and ENA.
Although I'm not entirely sure what we could call the option instead. Maybe FASTQ_OPT_SECOND_NAME
or FASTQ_OPT_NAME2
?
sam.c
Outdated
|
||
// Reverse the NCBI strangeness of putting the run_name.number before | ||
// the read name. | ||
i = 0; | ||
char *name = x->name.s+1; | ||
if (x->ncbi_names) { | ||
char *cp = strchr(x->name.s, ' '); | ||
if (cp) { | ||
*cp = '@'; | ||
i = cp - x->name.s; | ||
name = cp+1; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor point, but it might be good to move this after the x->nprefix
test below. As it is, it will "fix" files with lines that don't have the correct prefix, but do start with a leading space, which may or may not be a good thing...
Also, how general do we want this. Should we be looking for multiple spaces or tabs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could rename it FASTQ_OPT_SRA. I don't really like second name as it's long winded and like it or not this is an SRA specific format. ENA only use it because SRA use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'll take that back, I see what you mean. Maybe ERA fixed SRA format by still including the data rather than totally discarding it (which would be tragic as it removes the ability to dedup). I'm not really sure. SRA came first though so I mentally peg this as their error. I'll ponder.
The code has been revised.
|
This variant of FASTQ has the read name as the second field on the name line, just to be awkward.
e240d78
to
4a79f25
Compare
Looks OK. I've squashed & rebased, plus added a couple of simple tests to exercise the option in a new commit. |
This variant of FASTQ has the read name as the second field on the name line, just to be awkward.