Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Alphabet: Add the quality alphabet seqan3::phred94 #2290

Merged
merged 5 commits into from
Dec 8, 2020

Conversation

Irallia
Copy link
Contributor

@Irallia Irallia commented Dec 1, 2020

Resolves seqan/product_backlog#267

The task 'Check if the tutorials need an update.' did not require any work, because there is no tutorial yet. I will add a tutorial when solving seqan/product_backlog#266.
For task 'Update the documentation on the quality landing page' I left three comments for the reviewer, which I will remove after discussion.

I splittet the PR in several commits and would recommend to review commit wise.

@Irallia Irallia self-assigned this Dec 1, 2020
@codecov
Copy link

codecov bot commented Dec 1, 2020

Codecov Report

Merging #2290 (d797403) into master (066b704) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2290   +/-   ##
=======================================
  Coverage   98.18%   98.18%           
=======================================
  Files         260      261    +1     
  Lines       10809    10820   +11     
=======================================
+ Hits        10613    10624   +11     
  Misses        196      196           
Impacted Files Coverage Δ
include/seqan3/alphabet/quality/phred42.hpp 100.00% <ø> (ø)
include/seqan3/alphabet/quality/phred63.hpp 100.00% <ø> (ø)
include/seqan3/alphabet/quality/phred94.hpp 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 066b704...d797403. Read the comment docs.

@Irallia Irallia requested review from a team and MitraDarja and removed request for a team December 2, 2020 08:45
@@ -186,7 +186,7 @@ which by default is set to seqan3::sequence_file_input_default_traits_dna.
We thereby assume that
* you want to read the sequence into a std::vector over a seqan3::dna5 alphabet,
* store the sequence names/id in a std::string and
* the qualities in a std::vector over the seqan3::phred42 alphabet.
* the qualities in a std::vector over the seqan3::phred42 alphabet. <- da default phred94 werden soll, ändert sich hier etwas?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ja aber das wird am besten in einem separatem PR gemacht :)

* | Sanger, Illumina 1.8+ short | seqan3::phred42 | [0 .. 41] | [0 .. 41] | ['!' .. 'J'] | Phred score in [0 .. 61] |
* | Sanger, Illumina 1.8+ long | seqan3::phred63 | [0 .. 62] | [0 .. 62] | ['!' .. '_'] | Phred score in [0 .. 62] |
* | Solexa, Illumina [1.0; 1.8[ | seqan3::phred68legacy | [-5 .. 62] | [0 .. 67] | [';' .. '~'] | Phred score in [-5 .. 62] |
* | PacBio | seqan3::phred94 | [0 .. 62] | [0 .. 93] | ['!' .. '~'] | Phred score in [0 .. 93] |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a Pacbio format. It is actually also Sanger, Illumina 1.8+ but the full range. The other two are both short in that sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I rename long into middle when I rename PacBio into Sanger, Illumina 1.8+ full range?

@@ -25,14 +26,15 @@ using gapped_qualified_dna_phred42 = seqan3::gapped<qualified_dna_phred42>;
using qualified_qualified_gapped_dna_phred42_phred42 = seqan3::qualified<qualified_gapped_dna_phred42, seqan3::phred42>;
using gapped_alphabet_variant_dna_phred42 = seqan3::gapped<seqan3::alphabet_variant<seqan3::dna4, seqan3::phred42>>;

// Some haessllihckeiten-tests
// Some haessllihckeiten-tests <--???
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈 that was frustrated-me when I worked on this 2 years ago.
The review process clearly failed here 🤣

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we already remove this in another PR? Are there multiple occurrences?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Copy link
Contributor

@MitraDarja MitraDarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks already really good, thanks for the advice how to review that helped a lot. I only have some small things... :)
I could not answer every question, so I would leave that to the second reviewer.

Comment on lines 36 to 37
* ['!' .. '~']. It represents the PacBio standard typically HiFi reads. Beyond these the typical range of Sanger,
* Solexa or Illumina raw reads is (0 to 41), namely seqan3::phred42, and the larger score type (0 to 62), namely
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the sentence to something more like: For other (more typical) phred scores see...?
The Beyond start throw me off my reading. 😅

Comment on lines 13 to 14
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 75
// we need to cast to (int) for human readable console output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't cast here to int? Also, if you write a complete sentence in a comment, I would use a dot in the end and capitalize the first word.

Comment on lines 8 to 10
seqan3::debug_stream << static_cast<int>(phred.to_phred()) << "\n"; // 2
seqan3::debug_stream << phred.to_char() << "\n"; // '#'
seqan3::debug_stream << static_cast<int>(phred.to_rank()) << "\n"; // 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is to_rank and to_phred the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, yes. But not for every phred score alphabet, see seqan3::phred68legacy. When you have a look in the Table of the all.hpp you will recognize that the score could be negative.

@@ -25,14 +26,15 @@ using gapped_qualified_dna_phred42 = seqan3::gapped<qualified_dna_phred42>;
using qualified_qualified_gapped_dna_phred42_phred42 = seqan3::qualified<qualified_gapped_dna_phred42, seqan3::phred42>;
using gapped_alphabet_variant_dna_phred42 = seqan3::gapped<seqan3::alphabet_variant<seqan3::dna4, seqan3::phred42>>;

// Some haessllihckeiten-tests
// Some haessllihckeiten-tests <--???
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

* An overview of all the score formats and their encodings can be found here:
* https://en.wikipedia.org/wiki/FASTQ_format#Encoding.
* https://en.wikipedia.org/wiki/FASTQ_format#Encoding. <-- letztes abrufdatum hinzufügen? (last access 01.12.2020)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding the last access date is good practice, so you should do it.

*
* The most distributed format is the *Sanger* or <I>Illumina 1.8+</I> format.
* The most distributed format is the *Sanger* or <I>Illumina 1.8+</I> format. <-- ist Sanger noch most dirstributed? Und dann nicht eher PacBio?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still holds true.

@Irallia Irallia requested a review from MitraDarja December 4, 2020 13:54
@Irallia
Copy link
Contributor Author

Irallia commented Dec 4, 2020

@MitraDarja After you approve, I will rebase and remove the changes for the test/snippet/alphabet/quality/phred63.cpp and put them into a seperate PR, because you where totaly right, we dont need to cast to int so I will remove these things for the other snippets aswell.
I made also a bigger change for the table in the all.hpp. I hope that makes everything a bit clearer. I will adjust the text in the documentation PR for seqan/product_backlog#266.
For test/unit/alphabet/composite/composite_integration_test.cpp I spoke with @smehringer and we decided not to add any tests here, because a combination with phred::94 is the same as a combination with phred::63.

Copy link
Contributor

@MitraDarja MitraDarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done, just some really small things. ;)
Can you also check why the CI fails?

test/snippet/alphabet/quality/phred94.cpp Show resolved Hide resolved
Comment on lines 22 to 23
// tests various combinations of alphabet_variant and alphabet_tuple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the empty line?

Copy link
Contributor

@MitraDarja MitraDarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. :)

@MitraDarja MitraDarja requested review from a team and smehringer and removed request for a team December 7, 2020 08:13
Copy link
Member

@smehringer smehringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work :)

Only some small 💅 of some documentation phrasing and one question.

CHANGELOG.md Outdated
Comment on lines 33 to 34
* Added `seqan3::phred94`, a quality type for PacBio Phred scores of HiFi reads
([\#2290](https://github.com/seqan/seqan3/pull/2290)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Added `seqan3::phred94`, a quality type for PacBio Phred scores of HiFi reads
([\#2290](https://github.com/seqan/seqan3/pull/2290)).
* Added `seqan3::phred94`, a quality type that represents the full Phred Score range (Sanger format)
and is used for PacBio Phred scores of HiFi reads ([\#2290](https://github.com/seqan/seqan3/pull/2290)).

Comment on lines 71 to 72
* If you want to store PacBio HiFi reads, we recommend to use seqan3::phred93, as these reads are stored in SAM/BAM
* format and thus use the full range of the phred quality scores.
Copy link
Member

@smehringer smehringer Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If you want to store PacBio HiFi reads, we recommend to use seqan3::phred93, as these reads are stored in SAM/BAM
* format and thus use the full range of the phred quality scores.
* If you want to store PacBio HiFi reads, we recommend to use seqan3::phred94, as these
* use the full range of the phred quality scores.

Illumina is also stored in SAM/BAM so this might be confusing :)

@@ -21,7 +21,7 @@
namespace seqan3
{

/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores (full range).
/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores (full range). <-- ist 93 nicht full range?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores (full range). <-- ist 93 nicht full range?
/*!\brief Quality type for traditional Sanger and modern Illumina Phred scores.

I would just remove it for now and you can decide in the "Improve Docu PR" if you add something here again that is more helpful than "full range" or "middle range"

Comment on lines 17 to 20
// ------------------------------------------------------------------
// phred94
// ------------------------------------------------------------------

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ------------------------------------------------------------------
// phred94
// ------------------------------------------------------------------

Since there is nothing else in this file we don't need to add such separations I think :)

* \details
*
* The phred94 quality alphabet represents the zero-based phred score range [0..93] mapped to the ASCII range
* ['!' .. '~'] (Sanger, Illumina 1.8+ format). It represents the PacBio use case, typically HiFi reads. For Sanger and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* ['!' .. '~'] (Sanger, Illumina 1.8+ format). It represents the PacBio use case, typically HiFi reads. For Sanger and
* ['!' .. '~'] (Sanger, Illumina 1.8+ format). It is typically used for HiFi reads produced by PacBio. For Sanger and

Comment on lines 37 to 38
* Illumina phred scores of raw reads the range is (0 to 41), represented as seqan3::phred42, and the larger score type
* (0 to 62), represented as seqan3::phred63.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Illumina phred scores of raw reads the range is (0 to 41), represented as seqan3::phred42, and the larger score type
* (0 to 62), represented as seqan3::phred63.
* Illumina phred scores of raw reads the range is typically (0 to 41), represented as seqan3::phred42. If you expect only slightly
* larger score types you can use seqan3::phred63 (0 to 62) which still has memory advantages when used with `seqan3::qualified`.

check line length after applying the suggestion.

Comment on lines 12 to 18
seqan3::phred94 another_phred{75};
seqan3::debug_stream << another_phred.to_phred() << "\n"; // 75
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when you construct a phred of a higher value than 93 ? e.g. 105.
Can you check what happens?

Comment on lines 1 to 3
#include <seqan3/alphabet/quality/phred94.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/std/algorithm>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <seqan3/alphabet/quality/phred94.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/std/algorithm>
#include <seqan3/std/algorithm>
#include <seqan3/alphabet/quality/phred94.hpp>
#include <seqan3/core/debug_stream.hpp>

@smehringer
Copy link
Member

Please clean the commit history if you want the commits to remain separate. Otherwise ping me to (sqaush) merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ALPHABET] Add the quality alphabet seqan3::phred94
4 participants