-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Alphabet: Add the quality alphabet seqan3::phred94 #2290
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2290 +/- ##
=======================================
Coverage 98.18% 98.18%
=======================================
Files 260 261 +1
Lines 10809 10820 +11
=======================================
+ Hits 10613 10624 +11
Misses 196 196
Continue to review full report at Codecov.
|
doc/tutorial/sequence_file/index.md
Outdated
@@ -186,7 +186,7 @@ which by default is set to seqan3::sequence_file_input_default_traits_dna. | |||
We thereby assume that | |||
* you want to read the sequence into a std::vector over a seqan3::dna5 alphabet, | |||
* store the sequence names/id in a std::string and | |||
* the qualities in a std::vector over the seqan3::phred42 alphabet. | |||
* the qualities in a std::vector over the seqan3::phred42 alphabet. <- da default phred94 werden soll, ändert sich hier etwas? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ja aber das wird am besten in einem separatem PR gemacht :)
* | Sanger, Illumina 1.8+ short | seqan3::phred42 | [0 .. 41] | [0 .. 41] | ['!' .. 'J'] | Phred score in [0 .. 61] | | ||
* | Sanger, Illumina 1.8+ long | seqan3::phred63 | [0 .. 62] | [0 .. 62] | ['!' .. '_'] | Phred score in [0 .. 62] | | ||
* | Solexa, Illumina [1.0; 1.8[ | seqan3::phred68legacy | [-5 .. 62] | [0 .. 67] | [';' .. '~'] | Phred score in [-5 .. 62] | | ||
* | PacBio | seqan3::phred94 | [0 .. 62] | [0 .. 93] | ['!' .. '~'] | Phred score in [0 .. 93] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a Pacbio format. It is actually also Sanger, Illumina 1.8+
but the full range. The other two are both short in that sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I rename long
into middle
when I rename PacBio
into Sanger, Illumina 1.8+ full range
?
@@ -25,14 +26,15 @@ using gapped_qualified_dna_phred42 = seqan3::gapped<qualified_dna_phred42>; | |||
using qualified_qualified_gapped_dna_phred42_phred42 = seqan3::qualified<qualified_gapped_dna_phred42, seqan3::phred42>; | |||
using gapped_alphabet_variant_dna_phred42 = seqan3::gapped<seqan3::alphabet_variant<seqan3::dna4, seqan3::phred42>>; | |||
|
|||
// Some haessllihckeiten-tests | |||
// Some haessllihckeiten-tests <--??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙈 that was frustrated-me when I worked on this 2 years ago.
The review process clearly failed here 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we already remove this in another PR? Are there multiple occurrences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks already really good, thanks for the advice how to review that helped a lot. I only have some small things... :)
I could not answer every question, so I would leave that to the second reviewer.
* ['!' .. '~']. It represents the PacBio standard typically HiFi reads. Beyond these the typical range of Sanger, | ||
* Solexa or Illumina raw reads is (0 to 41), namely seqan3::phred42, and the larger score type (0 to 62), namely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the sentence to something more like: For other (more typical) phred scores see...?
The Beyond start throw me off my reading. 😅
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 75 | ||
// we need to cast to (int) for human readable console output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't cast here to int? Also, if you write a complete sentence in a comment, I would use a dot in the end and capitalize the first word.
seqan3::debug_stream << static_cast<int>(phred.to_phred()) << "\n"; // 2 | ||
seqan3::debug_stream << phred.to_char() << "\n"; // '#' | ||
seqan3::debug_stream << static_cast<int>(phred.to_rank()) << "\n"; // 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is to_rank and to_phred the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, yes. But not for every phred score alphabet, see seqan3::phred68legacy
. When you have a look in the Table of the all.hpp you will recognize that the score could be negative.
@@ -25,14 +26,15 @@ using gapped_qualified_dna_phred42 = seqan3::gapped<qualified_dna_phred42>; | |||
using qualified_qualified_gapped_dna_phred42_phred42 = seqan3::qualified<qualified_gapped_dna_phred42, seqan3::phred42>; | |||
using gapped_alphabet_variant_dna_phred42 = seqan3::gapped<seqan3::alphabet_variant<seqan3::dna4, seqan3::phred42>>; | |||
|
|||
// Some haessllihckeiten-tests | |||
// Some haessllihckeiten-tests <--??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
* An overview of all the score formats and their encodings can be found here: | ||
* https://en.wikipedia.org/wiki/FASTQ_format#Encoding. | ||
* https://en.wikipedia.org/wiki/FASTQ_format#Encoding. <-- letztes abrufdatum hinzufügen? (last access 01.12.2020) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding the last access date is good practice, so you should do it.
* | ||
* The most distributed format is the *Sanger* or <I>Illumina 1.8+</I> format. | ||
* The most distributed format is the *Sanger* or <I>Illumina 1.8+</I> format. <-- ist Sanger noch most dirstributed? Und dann nicht eher PacBio? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it still holds true.
@MitraDarja After you approve, I will rebase and remove the changes for the test/snippet/alphabet/quality/phred63.cpp and put them into a seperate PR, because you where totaly right, we dont need to cast to int so I will remove these things for the other snippets aswell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done, just some really small things. ;)
Can you also check why the CI fails?
// tests various combinations of alphabet_variant and alphabet_tuple | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the empty line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work :)
Only some small 💅 of some documentation phrasing and one question.
CHANGELOG.md
Outdated
* Added `seqan3::phred94`, a quality type for PacBio Phred scores of HiFi reads | ||
([\#2290](https://github.com/seqan/seqan3/pull/2290)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Added `seqan3::phred94`, a quality type for PacBio Phred scores of HiFi reads | |
([\#2290](https://github.com/seqan/seqan3/pull/2290)). | |
* Added `seqan3::phred94`, a quality type that represents the full Phred Score range (Sanger format) | |
and is used for PacBio Phred scores of HiFi reads ([\#2290](https://github.com/seqan/seqan3/pull/2290)). |
* If you want to store PacBio HiFi reads, we recommend to use seqan3::phred93, as these reads are stored in SAM/BAM | ||
* format and thus use the full range of the phred quality scores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* If you want to store PacBio HiFi reads, we recommend to use seqan3::phred93, as these reads are stored in SAM/BAM | |
* format and thus use the full range of the phred quality scores. | |
* If you want to store PacBio HiFi reads, we recommend to use seqan3::phred94, as these | |
* use the full range of the phred quality scores. |
Illumina is also stored in SAM/BAM so this might be confusing :)
@@ -21,7 +21,7 @@ | |||
namespace seqan3 | |||
{ | |||
|
|||
/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores (full range). | |||
/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores (full range). <-- ist 93 nicht full range? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores (full range). <-- ist 93 nicht full range? | |
/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores. |
I would just remove it for now and you can decide in the "Improve Docu PR" if you add something here again that is more helpful than "full range" or "middle range"
// ------------------------------------------------------------------ | ||
// phred94 | ||
// ------------------------------------------------------------------ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ------------------------------------------------------------------ | |
// phred94 | |
// ------------------------------------------------------------------ |
Since there is nothing else in this file we don't need to add such separations I think :)
* \details | ||
* | ||
* The phred94 quality alphabet represents the zero-based phred score range [0..93] mapped to the ASCII range | ||
* ['!' .. '~'] (Sanger, Illumina 1.8+ format). It represents the PacBio use case, typically HiFi reads. For Sanger and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* ['!' .. '~'] (Sanger, Illumina 1.8+ format). It represents the PacBio use case, typically HiFi reads. For Sanger and | |
* ['!' .. '~'] (Sanger, Illumina 1.8+ format). It is typically used for HiFi reads produced by PacBio. For Sanger and |
* Illumina phred scores of raw reads the range is (0 to 41), represented as seqan3::phred42, and the larger score type | ||
* (0 to 62), represented as seqan3::phred63. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Illumina phred scores of raw reads the range is (0 to 41), represented as seqan3::phred42, and the larger score type | |
* (0 to 62), represented as seqan3::phred63. | |
* Illumina phred scores of raw reads the range is typically (0 to 41), represented as seqan3::phred42. If you expect only slightly | |
* larger score types you can use seqan3::phred63 (0 to 62) which still has memory advantages when used with `seqan3::qualified`. |
check line length after applying the suggestion.
seqan3::phred94 another_phred{75}; | ||
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 75 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when you construct a phred of a higher value than 93 ? e.g. 105.
Can you check what happens?
#include <seqan3/alphabet/quality/phred94.hpp> | ||
#include <seqan3/core/debug_stream.hpp> | ||
#include <seqan3/std/algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <seqan3/alphabet/quality/phred94.hpp> | |
#include <seqan3/core/debug_stream.hpp> | |
#include <seqan3/std/algorithm> | |
#include <seqan3/std/algorithm> | |
#include <seqan3/alphabet/quality/phred94.hpp> | |
#include <seqan3/core/debug_stream.hpp> |
Please clean the commit history if you want the commits to remain separate. Otherwise ping me to (sqaush) merge. |
Signed-off-by: Lydia Buntrock <[email protected]>
… test cases Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
1d16c0b
to
d797403
Compare
Resolves seqan/product_backlog#267
The task 'Check if the tutorials need an update.' did not require any work, because there is no tutorial yet. I will add a tutorial when solving seqan/product_backlog#266.
For task 'Update the documentation on the quality landing page' I left three comments for the reviewer, which I will remove after discussion.
I splittet the PR in several commits and would recommend to review commit wise.