Skip to content

Commit

Permalink
Change accidental Ml:Z tag to Ml:B:C,.
Browse files Browse the repository at this point in the history
Also explicitly forbid probabilities summing to more than 1.0.
  • Loading branch information
jkbonfield committed May 24, 2021
1 parent 7fafbdf commit 039c151
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions SAMtags.tex
Original file line number Diff line number Diff line change
Expand Up @@ -607,11 +607,11 @@ \subsection{Base modifications}

If the above is rewritten in the multiple-modification form, the probabilities are interleaved in the order presented, giving ``{\tt Mm:Z:C+mh,5,12; Ml:B:C,204,26,89,130}''.
Note where several possible modifications are presented at the same site, the {\tt Ml} values represent the absolute probabilities of the modification call being correct and not the relative likelihood between the alternatives.
These need not sum to 256 ($p \approx 1.0$) as the remainder represents the probability that none of the modification types are present.
These probabilities should not sum to above 1.0 ($\approx 256$ in integer encoding, allowing for some minor rounding errors), but may sum to a lower total with the remainder representing the probability that none of the listed modification types are present.
In the example used above, the 6th {\tt C} has 80\% chance of being {\tt 5mC}, 10\% chance of being {\tt 5hmC} and 10\% chance of being an unmodified {\tt C}.

{\tt Ml} values for ambiguity codes give the probability that the modification is one of the possible codes compatible with that ambiguity code.
For example {\tt Mm:Z:C+C,10; Ml:Z:229} indicates a C call with a probability of 90\% of having some form of unspecified modification.
For example {\tt Mm:Z:C+C,10; Ml:B:C,229} indicates a C call with a probability of 90\% of having some form of unspecified modification.

\end{description}

Expand Down

4 comments on commit 039c151

@cmdcolin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this imply that there will never actually be multiple modifications at a position? I am not aware of the intricacies of the biology there

@cmdcolin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or more generally, I suppose it would just obtain a new chemical code if there were truly multiple modifications at a single site?

@jkbonfield
Copy link
Contributor Author

@jkbonfield jkbonfield commented on 039c151 May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope so! Explicitly, it means that we can record multiple modifications options, of which one will (hopefully!) be correct. Although there can be another independent modification on the opposite strand.

I would guess if there is a modified base type which has the combined characteristics of two different modifications then it should be given its own code. This isn't something I know enough about to explicitly state as fact though! However having seen the hundreds of base modifications at ChEBI, I'm pretty confident that this is how they operate already.

Edit: I should point out the notion of multiple mods at the same loci was developed after discussing this with ONT. Their base caller is trained on a set of known mods and basically emits probabilities for everything at each call. They can (and I think do) trim the list down somewhat for the cases where some probabilities are close to zero, but basically the model is "we have these choices and this is how the probabilities are distributed between them".

It's something we discussed in the early days of short read sequencing for A,C,G,T - rather than emit one base with Phred score, emit all 4 with likelihoods. It can definitely improve consensus generation (I even wrote the code for it in Gap5) and variant calling. Sure you may claim it's A with phred score 10 (p=0.9), but if everything else in the column is a T and your remaining probability is also T (p=0.1) then it's much more likely to be a sequencing error than if the remaining probability was G (p=0.1) and T is extremely low.

I view base mods as basically the same idea, and having multiple choices available can really help consensus calling.

@cmdcolin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbonfield thanks for the reply, that makes sense. interesting historical note too...maybe this would lead to encoding alternative bases as basemods heh

Please sign in to comment.