Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to retrieve the primary alignment for secondary and supplementary reads #755

Closed
nh13 opened this issue Jan 29, 2024 · 9 comments
Closed
Assignees
Labels

Comments

@nh13
Copy link
Member

nh13 commented Jan 29, 2024

I would like to retrieve the alignment information for a given secondary or supplementary alignment .

For example, in duplicate marking, we may choose to use the primary alignments of template (read pair) to identify reads that are duplicates of each other. We then want to update the duplicate flag not only the primary alignments but also on the secondary and supplementary reads. To make duplicate marking (and grouping templates that are likely sequencing the same source molecule) more efficient (i.e. linear pass/streaming), we previously introduced the template-coordinate sub-sort (see samtools/samtools#1605). The goal is to have all alignments from the same query appear consecutively in the file (all alignments for any read in the template), and then templates sorted by outer coordinates (since then consecutive templates with the same outer coordinates may be marked as duplicates).

Using the current implementation of that sort does not yield the correct behavior for duplicate marking in some cases where we have non-primary alignments are distant from the primary.

Consider the following two templates, with three alignments each:

Q1	65	chr2	66	60	60M91S	chr3	46372947	0	...
Q1	129	chr3	47	60	55M60H	chr2	129394066	0	...
Q1	2195	chr2	66	60	60M55S	=	129394066	-60	...
Q2	81	chr3	47	60	105S46M	=	46372947	-46	...
Q2	161	chr3	47	60	46M	=	46372947	46	...
Q2	2113	chr2	51	0	119H32M	chr3	46372947	0	...

When sorted template-coordinate (outer ends of the template), we get:

Q1	2195	chr2	66	60	60M55S	=	129394066	-60	...
Q2	2113	chr2	51	0	119H32M	chr3	46372947	0	...
Q1	65	chr2	66	60	60M91S	chr3	46372947	0	...
Q1	129	chr3	47	60	55M60H	chr2	129394066	0	...
Q2	161	chr3	47	60	46M	=	46372947	46	...
Q2	81	chr3	47	60	105S46M	=	46372947	-46	...

This is because the outer coordinates for Q1 differ between the primary alignments as a pair for R1/R2, versus the primary alignment of R1 and the supplementary alignment of R2.

To overcome this issue, we would like to know the primary alignments (RNAME, POS, CIGAR, and MAPQ) to be able to get the outer coordinates for any secondary or supplementary alignment.

Solution 1: we could annotate each non-primary alignment (both secondary and supplementary alignments) with SA tag for, and then rely on the convention that the primary alignment being the first alignment in the SA tag. We could change the convention to a requirement. Nonetheless, the SA tag is described as being used for chimeric (i.e. supplementary) alignments, so the SA tag would need to be output by aligners (or post processing tools like fgbio ZipperBams) for secondary reads.

Solution 2: add a new tag that describes the primary alignment (PA?) of the read. If we have read pairs, then we can use PA tag to get the coordinates of this read's primary alignment, then for the primary alignment of its mate we could use a combination of RNEXT and PNEXT as well as the MC and MQ tags. For templates that have >2 reads (very uncommon, but still), this would not work, so we could consider the PA tag have the list of primary alignments in sequencing order (R1/R2/R2/...)?

I would welcome other solutions to this problem, or rethinking of this problem entirely!

@d-cameron
Copy link
Contributor

Secondary are supplementary are quite different. Secondary is an entirely different alignment, where as supplementary records encode chimeric alignments. The tags by themselves leave ambiguity but that is resolved by HI^. Typically, duplicate marking is done on the primary records but, that said, secondary records are typically not written at all ^^ (notable behaviour: bwa weirdness where alternative alignments to ALT contigs are written with the supplementary alignment flag instead of the secondary alignment flag - breaks split read handling code but lh3 disagrees with the spec as-written so doesn't want to change this).

rely on the convention that the primary alignment being the first alignment in the SA tag

Not completely unrealistic given that the definition of SA stats that Conventionally, at a supplementary line, the first element points to the primary line.

^ The specs themselves are under-defined. I generally assume/implement:

  • Segment alignments form distinct non-overlapping groups (c.f. alignments form a tree of possible partial alignments)
  • The HI defines each non-overlapping group (i.e. supplementary records have the same HI value as their non-supplementary record)
  • SA includes the other segment records with the same HI
  • First SA is the non-supplementary record (could be secondary if it's not the first HI)
  • PNEXT of defines which mate to pair the alignment with (i.e. a secondary alignment can be secondary for just that read, or a secondary for the entire template)
  • Records with matching /1//2/FI & HI tags all have the same PNEXT

^^ Multi-mapping was toyed with in the early 2010s but never caught on., generally speaking, the community has moved on to genome graphs (although technically these aren't mutually exclusive approaches).

@nh13
Copy link
Member Author

nh13 commented Jan 30, 2024

@d-cameron I took a look at what current duplicate marking software does (see below). Nonetheless, I'd like a solution that's not bespoke to solve this issue, since I would like to update both fgbio and samtools sort --template-coordinate to utilize whatever recommendation comes from this discussion to make duplicate marking a one-pass-through-the-file process.

Do you have a suggestion how we can get the primary alignment information (equivalent info as in the SA tag) for both secondary and supplementary alignments?

Do existing duplicate marking software output/mark secondary/supplementary alignments?

secondary records are typically not written at all

  1. What does picard do?

picard MarkDuplicates does not discard the records, just that the duplicate flag is only set on supplementary/secondary alignments when input is coordinate sorted. From the manual.

The program can take either coordinate-sorted or query-sorted inputs, however the behavior is slightly different. When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.

  1. What does samtools do?

samtools markdup I think discards secondary by default, and can mark supplementary with the -S flag.

  1. What does fgbio GroupReadsByUmi do?

By default it outputs secondary and supplementary alignments, and the goal with the original question is to enable setting the duplicate flags on the secondary and supplementary alignments based on the primary alignments, without having to make two passes through the original file.

@jkbonfield
Copy link
Contributor

I don't think samtools markdup discards anything - you're probably remembering the ancient rmdup. Instead it modifies flags.

Markdup takes an alternative approach to permit (mostly) single pass analysis. We need some analysis while in name-collated order, and some while in position sorted order. Given the former is typically what the aligners produce and the latter is what the users need, the other side of a samtools sort (by pos), we split markdup up into two halves.

  1. Process the name collated data and create additional auxiliary tags caching the other ends of reads.
  2. Process the position sorted data, utilising the extra aux tags, to identify duplicates.

Obviously that's a general design principle, but the details (what's cached in what tag) is open ended.

I think there will still be ordering problems, eg did the secondary matches arrive before the primaries, in which case it's hard to fix the secondaries which have already been streamed out, but it could be possible to record those in a separate file and then have a third pass through to fix those up. Importantly it doesn't need complex sorting steps, or at least no additional sorts. I know your goal is to avoid an additional pass, but you already have this: name collated -> custom sort -> position sort. That's one extra sort order.

Andrew Whitwham (@whitwham) is the person who implemented this so he'll understand your ticket better than I.

As for your suggestions, why SA and not XA? SA is explicitly supplementary and XA is BWA's extension for (I think) secondary alignments. See the table in https://bio-bwa.sourceforge.net/bwa.shtml If we wanted to adopt it then we'd need to rename the tag of course.

XA Alternative hits; format: (chr,pos,CIGAR,NM;)*

@whitwham
Copy link

  1. What does samtools do?

samtools markdup I think discards secondary by default, and can mark supplementary with the -S flag.

It does nothing of the sort. In default mode, supplementary and secondary reads are left untouched. With the -S flag they are marked as duplicate if their primaries are duplicates.

I have long had an idea for a single pass duplicate marker which would work on name collated data. It would use much of the same code as markdup but in a different way. However it is on my "do in my copious free time" list rather than on anyone's priority list.

@d-cameron
Copy link
Contributor

Do you have a suggestion how we can get the primary alignment information (equivalent info as in the SA tag) for both secondary and supplementary alignments?

The general solution to this is to process in name-collated sort order. Adding additional fields containing purely redundant information isn't something that I'd want standardised in the SAM specs - we have enough issues with data inconsistencies caused by tools not updating the existing redundant fields.

The specs intentionally don't define what makes a duplicate read. If you're returning multiple different read mapping (i.e. secondary alignments exist), one could make a case that duplicate marking should consider the all the alignments and that duplicate marking based purely on the primary alignment discards information. Are two read pairs with the same primary alignment but different secondary alignments really duplicates?

I know it's not a satifying response but I'm in favour of closing this as wontfix. samtools already uses custom tags to encode the required information so there's a precedent there. Using a custom sort order that allows single-pass duplicate marking is a good idea but the actual sort order depends on the details of the duplicate marking algorithm and given the specs are silent on that, it should remain silent on the sort order.

@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 27, 2024

I'm inclined to agree with Daniel here. Samtools uses private custom tags, but that's because it's both generating and later consuming those same tags, so both parts are under its control. It doesn't need them to be official.

Also I don't entirely understand how template-coordinate is fullfilling a one-pass solution. It is neither the sort order output from an aligner, nor is it the final solution required by the user (coordinate). Hence template-coordinate is a minimum of one pass through the data (more if we're spilling to disk and merging sub-sorts) just to get the data in the correct ordering. Or do you use a combined sort and mark-dup tool?

The specs intentionally don't define what makes a duplicate read. If you're returning multiple different read mapping (i.e. secondary alignments exist), one could make a case that duplicate marking should consider the all the alignments and that duplicate marking based purely on the primary alignment discards information. Are two read pairs with the same primary alignment but different secondary alignments really duplicates?

Even primary vs secondary is muddy. This is somewhere I disagree with the spec infact, but gave up trying to get a unified view as everyone had their own alternative views.

Logically speaking, if we're mapping templates rather than reads, we could say "this pair of reads aligns as a consistent pair here - that is our PRIMARY alignment", but "this pair also alignes as a consistent pair over here too - these are out SECONDARY alignments". However, the specification requires the secondary pair do not point to each other in the RNEXT/PNEXT/TLEN fields, but instead point to the primaries. That's not so useful. It doesn't really even feel like a proper secondary alignment if it's only half the template that aligns... However that ship has sailed, and a substantialy number of people just disobey the spec anyway so the fields become rather unreliable at best.

It gets worse as the specification implies that SUPPLEMENTARY reads aren't primary, but that also sails against common sense. A supplementary read is one where the alignment is split, so (with a small amount of wiggle room) each base appears once, in one of the alignment parts. Like an cDNA sequencing library could produce. It feels logical that the best alignment of the whole fragment produced by the sequencing instrument should be primary, not just the very start of it. (A single-ended run with a supplementary component is comparable to a read-pair, bar strand; it'd have a starting read1 part and a supplementary read 2 part. Both could form part of a primary template alignment, and both could be part of a secondary template alignment.) However again, the ship has sailed and we have to adhere to the original definitions. My views are apparently not the norm and I've come to accept that. :-)

For duplicates though, I think the intention was to refer to duplicate templates caused by say amplication errors (eg PCR duplicate), or duplicate reading of the same cluster in two neighbouring X/Y locations on the plate (bad visual separation, maybe caused by an oversized cluster or an optical lens defect). In that regard, duplicate sounds like it applies to the template, so it's all records or none of them. If two templates are considered to be duplicates, then I'd expect that to be true regardless of how many secondary alignments they have. For genuine duplicates we should not necessarily expect the secondary alignments to be identical (or even the primary alignments in some cases) as there is a certain element of random placement within repeats.

I think it's easier though for the specification to remain silent on it and just categorise it all as "duplicate" and let the tools decide how to do that, as ultimately how we get there doesn't matter and it'll differ per technology and per library preparation type. What matters is the implication of being a duplicate, which is that we may not wish to double count when computing statistics.

@nh13
Copy link
Member Author

nh13 commented Mar 29, 2024

@whitwham

It does nothing of the sort.

My apologies for mis-interpreting samtools markdup, since I read places in the code that filter out supplementary/secondary, but that's just for temporary sorting.

@d-cameron

The general solution to this is to process in name-collated sort order

See below for why we would need to sort twice, though we could add custom tags (both by adding to the spec or using private ones) when the reads are in name-collated order coming right out of the aligner.

@d-cameron

samtools already uses custom tags to encode the required information so there's a precedent there.

For ms yes, which is added by fixmate, so I could add private tags for improved template-coordinate sorting in that tool?

@jkbonfield

As for your suggestions, why SA and not XA? SA is explicitly supplementary and XA is BWA's extension for (I think) secondary alignments. See the table in https://bio-bwa.sourceforge.net/bwa.shtml If we wanted to adopt it then we'd need to rename the tag of course.

Let's ignore how samtools duplicate marks, and consider the original motivation: We wanted to add duplicate marking to our fgbio GroupReadsByUmi tool. This tool expects the input to be template-coordinate sorted, which we added to samtools, otherwise it sorts internally into coordinate order. Samtools is much faster (and threaded) than the fgbio/Scala sort implementation, so want it correctly sorting there too!

In more detail, the fgbio best practices pipeline does the following:

  1. Extract UMI information and convert FASTQ to BAM
  2. Align the raw reads
  3. Group the reads to identify reads from the same source molecule. Internally, this tool will sort the reads into template-coordinate order so that reads with the same outer coordinates (and UMI) are consecutive in the BAM. This sort can be accelerated by using samtools sort --template-coordinate, especially the threaded version (thank-you samtools maintainers/developers/contributors). The output, due to the initial sorting, is still in template-coordinate order, but reads from the same source molecule (MI tag) are consecutive.
  4. The reads from the same source molecule are consensus-called, for error correction, especially when both strands of a double-strand source molecule are sequenced. Since the output of the previous step have read pairs from the same source molecule in consecutive order, no sorting needs to be done.

The original motivation of samtools/samtools#1605 is for the grouping and consensus calling above, so whatever solution, even using private SAM tags, will not only need to be added to fgbio (fulcrumgenomics/fgbio#964) but also to samtools sort.

I am 100% ok with the solution not to be added to the SAM spec. But I do want some information about the primary alignment when looking at a supplementary record. Right now I can't do that, even with the SA/XA tag. The SA tag isn't always output in secondary alignments, and relies on a convention. The XA isn't always populated by default (too many alternative hits).

Perhaps the solution is not to add anything to the spec, and work with the samtools maintainers to see if they would accept a PR with the same solution as in fulcrumgenomics/fgbio#964? That would be adding private tags (optionally?) in samtools fixmate, then update samtools sort --template-coordinate to use the private tags when present?

@jkbonfield
Copy link
Contributor

Regarding the SA tag being optional, there is nothing we can do to change that. Adding additional attributes to the specification will not change anything as by necessity they will also be optional and you cannot guarantee existing tools adopt them. It does feel like the only robust mechanism is additional tooling rather than specification.

That said, I agree it would be good if commonly recognised extensions such as XA could somehow be considered as official tags (albeit renamed as XA is in the private name space). Even so, that won't really solve your problem.

@jkbonfield
Copy link
Contributor

When we last discussed this on a conference call, we were in agreement that it doesn't feel like something that needs making a standard tag, and doing so wouldn't even solve the issue anyway as there is no way to force software tool authors to adopt it (as we see with things like the various hit-count tags). Hence we believe it's more of a tooling problem than a specification problem.

We did feel however that a non-authoritative list of common private tag types could be useful to the tool author. Things like XA are not offical, but sufficiently common it would be good to list them. This helps others not accidentally cause a clash (although they're obviously within their rights to reuse the tag for another purpose). It could perhaps be an appendix of the SAMtags document detailing commonly used private tag codes and a best-efforts link to documentation. However that's a different issue (PRs welcomed).

@github-project-automation github-project-automation bot moved this from New items to Done in GA4GH File Formats Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

5 participants