Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CIGAR string support to alignment IO #1192

Merged
merged 7 commits into from
Oct 9, 2019

Conversation

smehringer
Copy link
Member

@smehringer smehringer commented Jul 23, 2019

Blocked by #1194 see last commit, the alternative way to get the help page description does not cause the ICE, so I am not dependent on the ranges-v3 update.

Review commit by commit and note the following:

  • regarding commit 1: When the alingment IO was written, no cigar alphabet was present. This commit introduces the alphabet and replaces pair<char, uint32_t>.

  • regarding commit 2: This commit belongs to the refactoring of the first, but it did ONLY move the get_cigar_vector code above the get_cigar_string code to make use of it. Git does a horrible job in detecting this change (where actually there is none...). So I kept it separate to see the actual differences in the former commit.

  • regarding commit 3: We ignored hard clipping and stored soft clipping in a separate variable before because it was not needed. Now the CIGAR string should represent the exact input, with hard clipping, so the functions were adapted to return the complete cigar vector.

  • regarding commit 4: The parse_cigar/parse_binary_cigar functions were format specific and should not be free functions but member functions.

  • regarding commit 5: Support reading of CIGAR fields.

  • regarding commit 6: Support writing of CIGAR fields. Note that for BAM, more elaborate changes were needed to compute the ref_length from the CIGAR string (formerly obtained by simply querying the alignment length) because BAM writes out bin information for the BAM index that needs the ref_length information.

@smehringer smehringer force-pushed the alignmentIO_cigar_string branch 2 times, most recently from 09edc53 to 956eff3 Compare July 23, 2019 07:38
@smehringer smehringer requested a review from marehr July 23, 2019 07:39
@smehringer smehringer force-pushed the alignmentIO_cigar_string branch from 956eff3 to bc4c356 Compare July 23, 2019 11:17
@smehringer smehringer requested review from joergi-w and removed request for marehr July 24, 2019 11:06
Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For commits 1–4. It's mostly about documentation that was not adapted to code changes.

include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_sam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_sam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_sam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_sam.hpp Outdated Show resolved Hide resolved
Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For commits 5-6.

doc/tutorial/alignment_file/index.md Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/input.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_sam.hpp Outdated Show resolved Hide resolved
Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more curiosity...

@smehringer smehringer force-pushed the alignmentIO_cigar_string branch from bc4c356 to 3887b07 Compare July 25, 2019 13:49
@smehringer smehringer requested a review from joergi-w July 25, 2019 13:51
Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already really good! However, a few small issues remain:

doc/tutorial/alignment_file/index.md Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_sam.hpp Outdated Show resolved Hide resolved
Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please check the CI results (header test + doxygen).

include/seqan3/io/alignment_file/input_format_concept.hpp Outdated Show resolved Hide resolved
@smehringer smehringer force-pushed the alignmentIO_cigar_string branch from 3887b07 to cd42682 Compare July 26, 2019 10:31
@codecov
Copy link

codecov bot commented Jul 26, 2019

Codecov Report

Merging #1192 into master will decrease coverage by 0.01%.
The diff coverage is 97.72%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1192      +/-   ##
=========================================
- Coverage   97.61%   97.6%   -0.02%     
=========================================
  Files         222     222              
  Lines        9007    8961      -46     
=========================================
- Hits         8792    8746      -46     
  Misses        215     215
Impacted Files Coverage Δ
include/seqan3/io/alignment_file/input.hpp 100% <ø> (ø) ⬆️
include/seqan3/io/alignment_file/output.hpp 100% <ø> (ø) ⬆️
include/seqan3/io/record.hpp 100% <ø> (ø) ⬆️
include/seqan3/io/alignment_file/format_bam.hpp 95.73% <100%> (+0.33%) ⬆️
include/seqan3/io/sequence_file/format_sam.hpp 87.5% <100%> (ø) ⬆️
include/seqan3/io/alignment_file/format_sam.hpp 98.6% <95.55%> (-0.38%) ⬇️
include/seqan3/io/alignment_file/detail.hpp 96.77% <97.36%> (-0.92%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19ab4cd...1cf7756. Read the comment docs.

Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks good!

@ppericard
Copy link

Hi,
Any news about integrating this pull request?
I am very eager to be able to get the CIGAR string from alignment files to continue my developments.
Thanks for all the work ^_^
Pierre

@smehringer
Copy link
Member Author

Hi @ppericard,

great to hear that this module is in use :)

There is an internal compiler error in the range library that has been fixed in the new range-v3 release so I am just wating for some PRs that make our library compatible to the that release. I think it will be a matter of 2-3 weeks.

I will try raise the priority of this!

@smehringer smehringer force-pushed the alignmentIO_cigar_string branch 2 times, most recently from db2f70d to 3bc5b0a Compare September 16, 2019 13:17
@smehringer smehringer requested a review from rrahn September 17, 2019 06:53
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
doc/tutorial/alignment_file/index.md Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/output_format_concept.hpp Outdated Show resolved Hide resolved
include/seqan3/io/sequence_file/format_sam.hpp Outdated Show resolved Hide resolved
@smehringer smehringer force-pushed the alignmentIO_cigar_string branch 2 times, most recently from 2520271 to da4e5b1 Compare September 25, 2019 10:29
@smehringer smehringer requested a review from rrahn September 25, 2019 10:29
Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor stuff 💅

include/seqan3/argument_parser/validators.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/detail.hpp Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_sam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/output.hpp Show resolved Hide resolved
@smehringer smehringer force-pushed the alignmentIO_cigar_string branch from da4e5b1 to c2c7f62 Compare September 30, 2019 14:27
@smehringer smehringer requested a review from rrahn September 30, 2019 14:46
Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was the final round.

include/seqan3/io/alignment_file/detail.hpp Outdated Show resolved Hide resolved
* is based on sequence at the second position of the \p alignment pair,
* namely the query sequence.
* \attention Note that CIGAR elements (respectively by their CIGAR operation) are always related to one of the
* two sequences in a pairwise alignment. In this case, the resulting cigar_vector is based on sequence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. What information do you like to express?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the cigar string is a relative thing... If it says "deletion of 2 bases", it means a deletion in the query not the reference, so you need to aware which is the reference sequence and which is the query sequence.

In the case of map_aligned_values_to_cigar_op it is query_char and reference_char.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I find the description a bit complicated. I proposed some alternative description in a comment above. What do you think?

include/seqan3/io/alignment_file/format_sam.hpp Outdated Show resolved Hide resolved
@smehringer
Copy link
Member Author

@rrahn Please mark the others as resolved 🙏

@smehringer smehringer force-pushed the alignmentIO_cigar_string branch from c2c7f62 to 0de953e Compare October 1, 2019 06:01
@smehringer smehringer requested a review from rrahn October 1, 2019 12:49
Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one little 💅 thing for some documentation.

*
* The following alignment reference sequence on top and the query sequence at
* the bottom.
* \attention Note that CIGAR elements (respectively by their CIGAR operation) are always related to one of the two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* \attention Note that CIGAR elements (respectively by their CIGAR operation) are always related to one of the two
The first sequence is always considered the reference sequence while the second one is considered the query sequence. The cigar operations regarding insertions and deletions are set accordingly.

Is it this, what you want to express?
If yes, could make the same/similar note above for the other function?

* is based on sequence at the second position of the \p alignment pair,
* namely the query sequence.
* \attention Note that CIGAR elements (respectively by their CIGAR operation) are always related to one of the
* two sequences in a pairwise alignment. In this case, the resulting cigar_vector is based on sequence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I find the description a bit complicated. I proposed some alternative description in a comment above. What do you think?

* in a pairwise alignment. In this case, the resulting cigar_vector
* is based on sequence at the second position of the \p alignment pair,
* namely the query sequence.
*
* ### Example:
* ### Theoretical Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unresolved?

* in a pairwise alignment. In this case, the resulting cigar_vector
* is based on sequence at the second position of the \p alignment pair,
* namely the query sequence.
* \attention Note that CIGAR elements (respectively by their CIGAR operation) are always related to one of the two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be also a candidate not to miss out if you agree on changing the description as proposed.

@smehringer smehringer force-pushed the alignmentIO_cigar_string branch 2 times, most recently from dada220 to 2062407 Compare October 7, 2019 13:47
@smehringer smehringer requested a review from rrahn October 7, 2019 13:47
Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation stuff 💅

* \tparam alignment_type Must model the seqan3::tuple_like and must have std::tuple_size 2.
* Each tuple element must be a range over values comparable to seqan3::gap.
* \param alignment The alignment, represented by a pair of aligned sequences,
* to be transformed into CIGAR_vector based on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CIGAR_vector
cigar_vector
cigar vector

From these three I would prefer the latter since it does not name a variable really. Also in the note block

@smehringer smehringer force-pushed the alignmentIO_cigar_string branch from 2062407 to 2abd6eb Compare October 8, 2019 07:14
@smehringer smehringer requested a review from rrahn October 8, 2019 07:14
Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am truly sorry, I just saw that there is still inconsistent uses of CIGAR string and cigar string as well. It is ok to leave it as is, but if you want to change it please keep me a heads up. Otherwise I will just merge it in a 30 minutes or so.

@smehringer smehringer force-pushed the alignmentIO_cigar_string branch 2 times, most recently from fabc4bb to 654bb1c Compare October 8, 2019 12:16
@smehringer
Copy link
Member Author

@rrahn I also updated the changelog!

@smehringer smehringer force-pushed the alignmentIO_cigar_string branch from 654bb1c to 1cf7756 Compare October 9, 2019 06:24
@smehringer
Copy link
Member Author

@rrahn I think this can be merged now. I also rebased on current master.

@smehringer
Copy link
Member Author

@rrahn ping

@rrahn rrahn merged commit 9f98962 into seqan:master Oct 9, 2019
@smehringer smehringer deleted the alignmentIO_cigar_string branch May 29, 2020 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants