-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to retrieve the primary alignment for secondary and supplementary reads #755
Comments
Secondary are supplementary are quite different. Secondary is an entirely different alignment, where as supplementary records encode chimeric alignments. The tags by themselves leave ambiguity but that is resolved by
Not completely unrealistic given that the definition of ^ The specs themselves are under-defined. I generally assume/implement:
^^ Multi-mapping was toyed with in the early 2010s but never caught on., generally speaking, the community has moved on to genome graphs (although technically these aren't mutually exclusive approaches). |
@d-cameron I took a look at what current duplicate marking software does (see below). Nonetheless, I'd like a solution that's not bespoke to solve this issue, since I would like to update both Do you have a suggestion how we can get the primary alignment information (equivalent info as in the SA tag) for both secondary and supplementary alignments? Do existing duplicate marking software output/mark secondary/supplementary alignments?
By default it outputs secondary and supplementary alignments, and the goal with the original question is to enable setting the duplicate flags on the secondary and supplementary alignments based on the primary alignments, without having to make two passes through the original file. |
I don't think Markdup takes an alternative approach to permit (mostly) single pass analysis. We need some analysis while in name-collated order, and some while in position sorted order. Given the former is typically what the aligners produce and the latter is what the users need, the other side of a
Obviously that's a general design principle, but the details (what's cached in what tag) is open ended. I think there will still be ordering problems, eg did the secondary matches arrive before the primaries, in which case it's hard to fix the secondaries which have already been streamed out, but it could be possible to record those in a separate file and then have a third pass through to fix those up. Importantly it doesn't need complex sorting steps, or at least no additional sorts. I know your goal is to avoid an additional pass, but you already have this: name collated -> custom sort -> position sort. That's one extra sort order. Andrew Whitwham (@whitwham) is the person who implemented this so he'll understand your ticket better than I. As for your suggestions, why SA and not XA? SA is explicitly supplementary and XA is BWA's extension for (I think) secondary alignments. See the table in https://bio-bwa.sourceforge.net/bwa.shtml If we wanted to adopt it then we'd need to rename the tag of course.
|
It does nothing of the sort. In default mode, supplementary and secondary reads are left untouched. With the I have long had an idea for a single pass duplicate marker which would work on name collated data. It would use much of the same code as markdup but in a different way. However it is on my "do in my copious free time" list rather than on anyone's priority list. |
The general solution to this is to process in name-collated sort order. Adding additional fields containing purely redundant information isn't something that I'd want standardised in the SAM specs - we have enough issues with data inconsistencies caused by tools not updating the existing redundant fields. The specs intentionally don't define what makes a duplicate read. If you're returning multiple different read mapping (i.e. secondary alignments exist), one could make a case that duplicate marking should consider the all the alignments and that duplicate marking based purely on the primary alignment discards information. Are two read pairs with the same primary alignment but different secondary alignments really duplicates? I know it's not a satifying response but I'm in favour of closing this as |
I'm inclined to agree with Daniel here. Samtools uses private custom tags, but that's because it's both generating and later consuming those same tags, so both parts are under its control. It doesn't need them to be official. Also I don't entirely understand how template-coordinate is fullfilling a one-pass solution. It is neither the sort order output from an aligner, nor is it the final solution required by the user (coordinate). Hence template-coordinate is a minimum of one pass through the data (more if we're spilling to disk and merging sub-sorts) just to get the data in the correct ordering. Or do you use a combined sort and mark-dup tool?
Even primary vs secondary is muddy. This is somewhere I disagree with the spec infact, but gave up trying to get a unified view as everyone had their own alternative views. Logically speaking, if we're mapping templates rather than reads, we could say "this pair of reads aligns as a consistent pair here - that is our PRIMARY alignment", but "this pair also alignes as a consistent pair over here too - these are out SECONDARY alignments". However, the specification requires the secondary pair do not point to each other in the RNEXT/PNEXT/TLEN fields, but instead point to the primaries. That's not so useful. It doesn't really even feel like a proper secondary alignment if it's only half the template that aligns... However that ship has sailed, and a substantialy number of people just disobey the spec anyway so the fields become rather unreliable at best. It gets worse as the specification implies that SUPPLEMENTARY reads aren't primary, but that also sails against common sense. A supplementary read is one where the alignment is split, so (with a small amount of wiggle room) each base appears once, in one of the alignment parts. Like an cDNA sequencing library could produce. It feels logical that the best alignment of the whole fragment produced by the sequencing instrument should be primary, not just the very start of it. (A single-ended run with a supplementary component is comparable to a read-pair, bar strand; it'd have a starting read1 part and a supplementary read 2 part. Both could form part of a primary template alignment, and both could be part of a secondary template alignment.) However again, the ship has sailed and we have to adhere to the original definitions. My views are apparently not the norm and I've come to accept that. :-) For duplicates though, I think the intention was to refer to duplicate templates caused by say amplication errors (eg PCR duplicate), or duplicate reading of the same cluster in two neighbouring X/Y locations on the plate (bad visual separation, maybe caused by an oversized cluster or an optical lens defect). In that regard, duplicate sounds like it applies to the template, so it's all records or none of them. If two templates are considered to be duplicates, then I'd expect that to be true regardless of how many secondary alignments they have. For genuine duplicates we should not necessarily expect the secondary alignments to be identical (or even the primary alignments in some cases) as there is a certain element of random placement within repeats. I think it's easier though for the specification to remain silent on it and just categorise it all as "duplicate" and let the tools decide how to do that, as ultimately how we get there doesn't matter and it'll differ per technology and per library preparation type. What matters is the implication of being a duplicate, which is that we may not wish to double count when computing statistics. |
My apologies for mis-interpreting samtools
See below for why we would need to sort twice, though we could add custom tags (both by adding to the spec or using private ones) when the reads are in name-collated order coming right out of the aligner.
For
Let's ignore how In more detail, the fgbio best practices pipeline does the following:
The original motivation of samtools/samtools#1605 is for the grouping and consensus calling above, so whatever solution, even using private SAM tags, will not only need to be added to fgbio (fulcrumgenomics/fgbio#964) but also to I am 100% ok with the solution not to be added to the SAM spec. But I do want some information about the primary alignment when looking at a supplementary record. Right now I can't do that, even with the SA/XA tag. The SA tag isn't always output in secondary alignments, and relies on a convention. The XA isn't always populated by default (too many alternative hits). Perhaps the solution is not to add anything to the spec, and work with the samtools maintainers to see if they would accept a PR with the same solution as in fulcrumgenomics/fgbio#964? That would be adding private tags (optionally?) in |
Regarding the SA tag being optional, there is nothing we can do to change that. Adding additional attributes to the specification will not change anything as by necessity they will also be optional and you cannot guarantee existing tools adopt them. It does feel like the only robust mechanism is additional tooling rather than specification. That said, I agree it would be good if commonly recognised extensions such as XA could somehow be considered as official tags (albeit renamed as XA is in the private name space). Even so, that won't really solve your problem. |
When we last discussed this on a conference call, we were in agreement that it doesn't feel like something that needs making a standard tag, and doing so wouldn't even solve the issue anyway as there is no way to force software tool authors to adopt it (as we see with things like the various hit-count tags). Hence we believe it's more of a tooling problem than a specification problem. We did feel however that a non-authoritative list of common private tag types could be useful to the tool author. Things like XA are not offical, but sufficiently common it would be good to list them. This helps others not accidentally cause a clash (although they're obviously within their rights to reuse the tag for another purpose). It could perhaps be an appendix of the SAMtags document detailing commonly used private tag codes and a best-efforts link to documentation. However that's a different issue (PRs welcomed). |
I would like to retrieve the alignment information for a given secondary or supplementary alignment .
For example, in duplicate marking, we may choose to use the primary alignments of template (read pair) to identify reads that are duplicates of each other. We then want to update the duplicate flag not only the primary alignments but also on the secondary and supplementary reads. To make duplicate marking (and grouping templates that are likely sequencing the same source molecule) more efficient (i.e. linear pass/streaming), we previously introduced the template-coordinate sub-sort (see samtools/samtools#1605). The goal is to have all alignments from the same query appear consecutively in the file (all alignments for any read in the template), and then templates sorted by outer coordinates (since then consecutive templates with the same outer coordinates may be marked as duplicates).
Using the current implementation of that sort does not yield the correct behavior for duplicate marking in some cases where we have non-primary alignments are distant from the primary.
Consider the following two templates, with three alignments each:
When sorted template-coordinate (outer ends of the template), we get:
This is because the outer coordinates for Q1 differ between the primary alignments as a pair for R1/R2, versus the primary alignment of R1 and the supplementary alignment of R2.
To overcome this issue, we would like to know the primary alignments (RNAME, POS, CIGAR, and MAPQ) to be able to get the outer coordinates for any secondary or supplementary alignment.
Solution 1: we could annotate each non-primary alignment (both secondary and supplementary alignments) with SA tag for, and then rely on the convention that the primary alignment being the first alignment in the SA tag. We could change the convention to a requirement. Nonetheless, the SA tag is described as being used for chimeric (i.e. supplementary) alignments, so the SA tag would need to be output by aligners (or post processing tools like
fgbio ZipperBams
) for secondary reads.Solution 2: add a new tag that describes the primary alignment (
PA
?) of the read. If we have read pairs, then we can usePA
tag to get the coordinates of this read's primary alignment, then for the primary alignment of its mate we could use a combination of RNEXT and PNEXT as well as theMC
andMQ
tags. For templates that have >2 reads (very uncommon, but still), this would not work, so we could consider thePA
tag have the list of primary alignments in sequencing order (R1/R2/R2/...)?I would welcome other solutions to this problem, or rethinking of this problem entirely!
The text was updated successfully, but these errors were encountered: