-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe MM and MP methylation tags. #418
Conversation
I think this is a good start and I like the proposal. I'm not sure whether I favor AllDelta or ConDelta yet. Something I'm unclear about, and I think is worth discussing at this point, is how the upstream tools generate these tags. Right now there are two approaches to modification calling - directly by the basecaller and by a post-processing tool like nanopolish. For nanopolish the workflow would be pretty simple - it could take in a BAM with canonical bases and augment the records to contain these tags. For basecalling it is less clear, as the basecaller needs to represent the modified bases directly in its output. Should it output FASTQ with DNAMod characters, or write out these mod strings in some auxiliary file? Perhaps for modification analysis the basecaller should write out unaligned SAM/BAM with canonical bases and these mod strings. |
When implementing a method, it can help to think about known extremes, and a good example of that is RNA modifications (over 110 known). Would those be treated as separate, distinct changes (e.g [5m] vs [6m] vs [6,6m]), or multi-layered / hierarchical modifications (e.g. [m;5] vs [m;6] vs [m;6,6])? |
Actually I realise I forgot to add any form of ambiguity codes, because these weren't in the paper (Coby Vine et al.). However I think they're probably useful. Eg "we know this base is methylated, but cannot tell which form" is a realistic output for some technologies. Being forced to state all modified C are 5mC is incorrect. Would just 1 ambiguity code per canonical base type be sufficient? Edit: perhaps for a later update, but would it be useful to base-callers outputting unaligned SAM to have the ability to specify (maybe in an RG header tag?) a list of methylation codes they can possibly emit - eg what they've been trained on? There's no way to do such global meta-data in FASTQ though, but that's why it should die :-) |
Quoting @jts:
It's clear we'll need separate PRs for SAM, VCF, and hmm - FASTQ isn't represented here. (Should it be? We discussed and rejected it in the past but I'm not entirely sure that's helpful.) My own personal view is FASTQ should be auxiliary tags after the Quoting @gringer
Thanks for the discussion. I think multi-layered / hierarchical is going to get messy and it may not be clear for all possible modifications out there, so it wouldn't be my personal preference. We did think about the extremes, which is what revealed to us that the letters in the sequence just isn't going to cut it. (We'd run out of symbols unless we go Unicode, and dear god let's not go down that route!) Hence the approach to permitting ChEBI IDs instead. It's also what lead us to wanting to not replicate a long ChEBI value over and over again, hence putting it up-front once and listing the coordinates it refers to. |
Given I haven't yet included ambiguity codes in the SAM PR (#418), and they're not well described in the source document either, I'm thinking of what symbols to use. https://dnamod.hoffmanlab.org/ The biorxiv link does have z for unknown cytosine modification and w / x / y for certain classes of C mods, but it doesn't seem to cover other base types. Additionally, z means C or modified C. I think the or can be encompassed in the quality score so ideally we'd have z being any of the known modifications (not just m h, f, c, although others are far rarer I assume) but not an unmodified C, and the quality score being the likelihood of it not being modified at all. I'm not familiar enough with the science and instruments to know how useful w, x and y are. Anyone closer to the coal-face here able to elaborate? Given in SAM we're encoding this as a side channel, we can reuse bases here that exist in the canonical channel. What do people think to using A, C, G and T in MM tag as generic modified ambiguity codes, that don't state explicitly the type of modification? That's not going to work in FASTQ though, but that's not really our bailiwick for this PR. Other ambiguities are already covered by ChEBI I think. Eg mG and mA in the dnamod plot have associated ChEBI values with links to the specific versions of mG. Again I'm not familiar with the science, so don't know how useful adding single letter codes would be. My gut feeling is keep it simple for now and we can extend if we need to later on. |
There's another ambiguity not mentioned above - bases that can't be confidently called as either modified or unmodified. This happens sometimes currently when looking at 5mC - we have three possible calls - either unmodified, modified, or unsure. The above proposes looking at unmodified, known modifications and unknown modifications (but definitely modified). |
Isn't that covered by the associated phred score? Eg if you think it's 70% likely then call 5mC and set a qual of 5 (-10 * log10 (.3)). Similarly if you think it's 20% 5hmC, 50% 5mC and 30% unmodified then use "C" (modified but not specifying which) and also keep the qual at 5 indicating 70% likelihood of the base being one of the known modifications. There isn't a way though to indicate how much more likely it is with one confidence than another, unless we explicitly make multiple calls at the same location (which is possible, but I'm not sure we should permit it.) |
I have a technical issue with the strand specification in the tag. The strand is currently defined with "either plus or minus indicating the strand the modification was observed on (relative to the recorded strand of SEQ with plus meaning same orientation),". The thing is that SEQ tends to get reverse complemented after mapping and the coordinates get messed up unless the downstream tools know how to reverse the MM/MP tags (which they don't). Would it be better if the strand would be defined in the original strand of SEQ? (I checked the SAM specification and "recorded" SEQ does mean the possibly reverse complemented one.) EDIT: In relation to the tag syntax is defined as
|
SAMtags.tex
Outdated
the same manner as the primary {\sf QUAL} field; one byte per quality | ||
with ASCII value Phred score + 33. No separators should be present. | ||
|
||
For example {\tt MM:Z:C+m,5,12,3;C+h,57;} may have an associated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better to separate (e.g. with a semi-colon) the qualities that pertain to different modifications? in the example, the qualities would thus become 5EB;/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(perhaps with a ~
which has been used in the past to separate lists of qualities)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have precedence elsewhere then it is good to follow convention.
Reviewing what we have already:
- OX/BZ. OX is hyphen-concatenated sequence. BZ is defined to be the same length and BZ uses space to separate them (blergh, but non quality value hence chosen).
- RX/QX. As above
- CR/CY. As above
- BC/QT. As above
I've tried hard to expunge that period from my mind, but it seems we got to a consensus somewhere along the line. :-)
So possibly we should do the same here? It's not quite analogous as we have a comma separated list instead of a sequence, but the same deal applies of choosing an invalid ASCII phred character as a separator. We can't however claim the strings need to be equal length of course.
Thoughts anyone? Space at least shows up really well when reading it.
I think that @kpalin is onto something. perhaps the modifications should always be written relative to the original (non-rev comped) strand? that will mean that aligners will not have to know how to flip MM and MP, and there will be no doubt about what strand the annotation pertains to. In addition, we should mention whether all the bases need to be accounted for and what it means if there are more bases in MM than are present in SEQ. |
Good point. If I recall my intention was simply that the aligners could always emit "+" for single-stranded DNA, which is basically everything(?) at the moment, but allowing room for future developments. However you're right - it's easier for aligners that don't understand methylation data to simply pass this tag through as-is irrespective of whether they've reverse complemented the sequence, thus it needs to relate to the original DNA strand. I'm not sure how best to phrase it hough.
That would be logical interpretation, but it probably needs to be explicit.
Well that would ideally be Thanks for the comments. |
I've updated it following comments. Thanks all. A few key points:
On that last case, there is room for improvement. Eg consider a modified ONT guppy caller that was trained to distinguish 5mC and 5hmC. We don't really want to have to use two separate lists C+m and C+h with many coordinates duplicated up. Equally so we may not wish to have every coordinate present in both C+m and C+h because we ideally only want to record the calls which have a sufficiently high probability that the phred score is above 1. This does rather come down to the core of it and whether the representation is correct. Eg I could imagine completely different mechanisms where we list the codes inline instead - more MD style - but with bracketing to handle multiple choices like a regexp character class. Eg This was one of the ideas discussed in #362, but at the time of writing I hadn't seen any real data in the wild and there was minimal external engagement. So it's still open for discussion. This PR was just my guess based on lack of any external prodding. |
SAMtags.tex
Outdated
This permits modifications to be listed on either strand with the rare | ||
potential for both strands to have a modification at the same site. | ||
|
||
Note it is permitted for the coordinate list to be empty (for example {\tt MM:Z:C+m;}), which may be used as an explicit indicator that this base modification is not present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would a technology emit modifications on + and - at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any can right now, but there were (are?) technologies that unwound the double stranded DNA and sequenced one strand followed by the other strand. IIRC PacBio SmartBells did this at one point, or maybe still do, and ONT earlier had their "2D" reads which did this.
The point being the format has to cope with what may come along down the road and this is conceivable, albeit a rather unlikely event to need to encode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ONT has had 1D^2 (essentially from when they stopped 2D), from which the forward and reverse-complement reads from a single double-stranded sequence (as separate reads) can be combined into a single sequence. They also plan to re-introduce fully-connected 2D sequences in the future.
SAMtags.tex
Outdated
|
||
To represent several possible modifications at the same site the {\tt MP} tag can be used to indicate the probabilities of each possibility. | ||
The values used should be absolute probabilities, not relative between the alternatives. | ||
For example, a C base that has 90\% chance of being modified with 5mC being three times more likely than 5hmC will encode 5mC with 67.5\% probability ($0.9 * 0.75$)and 5hmC with 22.5\% probability ($0.9 * 0.25$). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really want to be able to encode probabilities in the range of 10% - 90% perhaps we should not use Phred which has terrible resolution in that range (10%=10 and 90%=".45"...1?0?)
Please be explicit that the "
and &
are encoding phred 1 and 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time of writing this I hadn't seen any real data.
However I thought about that once I saw how ONT handle things. They report probabilities as p*255; ie p=0.33 is 84. They have arrays of values, so if trained on C, 5mC and 5hmC then they'd have 3 score for C which jointly add up to 255 (irrespective of whether they believe the call to be C - this is to be used after combining with the base call probabilities themselves), giving the relative likelihoods of the various modified and unmodified forms.
However it also means they cannot give higher certainties than 99.6%, or around phred 24. That may well be sufficient.
The question is, do we acknowledge this is different data and is never going to be extremely accurate? Do we actually care about small fluctuations in probabilities at the low end? Eg p=0.15 vs p=0.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops forgot to add the big about explicit " and &.
Quoting those is messy too as we have latex quotes vs ascii quotes mixed together! I'll try and find some different numbers that don't produce a quote. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually removing that phred ASCII quote is simply not possible, which comes down to the phred range again.
If we wish to emit probabilities for more than one possibility, which I believe is how ONT Guppy may go in the future (https://twitter.com/AlbertVilella/status/1151395294879395845), then phred scores will always generate a score of 0, 1 or 2 somewhere. That's because inherently there will be one call with p < 0.5.
The Illumina answer to this was to do phred+64 and use log-odds instead of log. Ie log(p/(1-p)) instead of log(1-p). That gives 0 being p=0.5 and is symmetric so choice A if p=.99 and choice B of p=0.01 gives phred scores of +20 and -20.
However it still begs the question of whether a log scale is appropriate, or whether even knowing the other values helps. When I was doing consensus calling algorithms I always grumbled that phred isn't good enough. If you're trying to make a call and you get A with P=0.9 but everything else is C, then it really helps to know whether than remaining 0.1 probabiity on the A was evenly spread across C, G and T or predominantly a remainder for C. (I just had to go with flat distribution in lieu of anything better.)
I simply don't have enough understanding of the downstream tools to know whether emitting the score for the most likely call is correct (the phred approach), or whether we the benefits of knowing 2nd, 3rd best choices outweighs the extra storage cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do not know what people will want to write (and the resolution they need) should we hold-off on defining the quality tag? perhaps suggest that the qualities be written into a lower-case tag until we have a better understanding? I don't want to "waste" a tag that will be unusable in the future.....
Can ONT or other technolgies read Uracil directly? At some point we will have modifications of U...should we allow U in the modified bases as well? |
The issue is that most of the downstream processing formats lose the per-read long range connection information. So I do think that swapping out the SEQ field for the reference sequence should be supported. I personally don’t have an issue with the practice of swapping out the SEQ field with the reference sequence, but I appreciate the concern that this is a relatively uncommon practice. If there were another solution to this issue I’d be open to it, but given the issues raised here, the current implementation of this tag seems to be the best solution at the moment. Note that for Nanopore data, processing higher accuracy calls are achieved by anchoring to the reference than anchoring to the basecalls. But this is still done on a per-read level. I mentioned these issues a while back in this thread #418 (comment) |
Thanks all, it seems the consensus is to not add a flag to I took a step back to evaluate how well projecting reference calls back onto the read sequence would work on two datasets, one a few years old (NA12878 WGS consortium) and the other very recent (a sample sequenced here last week). For NA12878 84.7% of sites cleanly project onto the read (the reference C matches a read C). 8.9% of sites project to a mismatch, but would be handled by @jkbonfield's suggestion to use For the recent sample (using a recent version of guppy) the numbers are 89.3%, 6.6%, 4.1% for match, mismatch, and deletion. I think 4% loss of calls is too high so would like either a new tag, or to emit the reference as SEQ. While I could live with using reference as SEQ, it would cause downstream difficulties. Some workflows (e.g. haplotype specific methylation) would be locked into a particular order of steps (align -> phase -> call methylation, as align -> call methylation -> phase won't work). This is incompatible with live methylation calling (calling during a run) that both nanopolish and megalodon support. One solution would be to keep two sets of BAMs around but this is inefficient and would be difficult to keep in sync. I'd prefer a new set of tags like @jrobinso suggested. |
@jts I’m not sure if this would play nice with downstream tools, but would the MD field suffice to transfer the sequence variant information in a single BAM file? I’m also not sure if this solution plays nicely with a perfectly matching cigar string. Figured it might be worth bringing up though. |
Maybe I'm missing something here. How is it possible to call a modified C on a read that doesn't even have a C there (it's a deletion)? The only way I can think of if is you're essentially redoing the base-calling (in the light of all the other sequences around it), in which case you should be emitting the recalled sequence too surely. It doesn't make any sense to me to have SEQ and base modifications coming from different length reads. |
Within megalodon, a read is basecalled and then mapped to the reference but the link between the raw nanopore signal and the reference is maintained. Then for modified base calling, the canonical basecalls are discarded and modified base calls are made using the reference bases and the raw signal in the region around a reference base of interest. The idea here is that the reference should be correct. If the reference for a sample is wrong this should be fixed before performing modified base calling. This process results in more accurate modified base calls than anchoring the mod calls to the original basecalls. |
By anchoring to the reference sequence, nanopolish/megalodon turn what is a largely unconstrained task (basecalling can emit any sequence of length L over the alphabet) into a binary classification problem (is this reference C methylated or not?). As @marcus1487 notes above this improves accuracy but leads to inconsistencies wrt the basecalled read (making calls at deletions). |
I'm not questioning that it gives more accurate results, but saying that if you can correctly call a base is methylated then you can also correctly call that base exists. Ie if you want to write this data back to BAM et al again, then why not write out the polished sequence field, as well as the polished methylation calls. It makes no sense to only emit one without the other, as that leads to all the issues we're discussing here. Or putting it another way, why wouldn't you want to also correct the base calls if you have evidence it was previously called incorrect? |
There are a few reasons we wouldn't want to output the corrected sequence:
These issues are not a concern if we're happy for this bam to only be used for methylation workflows (generating a bed file with frequencies, visualization) but if the plan is to provide a lightweight way to carry the methylation information through a workflow, all the way to archival in ENA, then I'm against modifying SEQ. |
@jkbonfield I have a question about allowed characters for the modification code. The motivation is to derive a rule to distinguish between mulit-modifications and modifications that are not one of the standard code values. From the draft spec I infer that the exahaustive list of standard 1-character codes is
The only other modification string mentioned are "chebi" codes, which appear to be integers. Thus given a string like 'Cmhf' I might infer it is a multi-modification because all the individual characters are in the standard code set(4 modifications in this instance, not likely but) Or I might infer it is not a chebi code because its not an integer, but could it just be a single modification that happens to be named 'Cmhf'?. Its unclear what the rule here is. |
It's a modified base and has a similar single-character encoding as bases, so Also your description is presumably something related to a third-party format (eg fasta) and not this spec. We never mix standard IUPAC codes and the modification codes together in SAM. One is a layer on top of the other. So you wouldn't be seeing |
@jkbonfield Thanks for the response, I got it, my contrived example was a poor one. I understand its a modified base but a base can have multiple modifications. (multiple modification mode), but I suppose no more than 2? By my reading "mh" could be 2 modifications of 1 base or, some code called "mh", I was unsure of what the rules were for characters making up a ChEBI code, or for that matter if ChEBI code is the only allowed type of code. Reading it again I think that it is. Still when seeing multiple characters for the modification one has to decide is this s ChEBI code or a multi-modification, I'm using the rule that if all the characters are among the table of codes in the spec this is a multi-modification. |
I'll unceremoniously wade in here, since I think both @jts and @marcus1487 implied it but stopped short of saying it:
In essence, yes. The process that nanopolish et. al. others take is akin to variant calling. The resultant "basecall" for the read is the reference sequence with some Cs (other other base) swapped out for modified forms. Megalodon already does write out a BAM as you suggest with the SEQ field swapped out to a sequence concomitant with the methylation tag: this sequence is precisely the reference sequence. My only concern here is the knock-on effect this has for users. I can imagine users loading these files into IGV and immediately sending @jrobinso a bug report that the reads all display perfectly as the reference sequence. Perhaps this is just a bit of user education, but my gut feeling is that it would be preferable to store both the basecall and the methylation data w.r.t. the reference sequence if we can find a way. @jts already raised the idea of a flag within the MM tag, perhaps another idea is to define a separate reserved tag? |
SAMtags.tex
Outdated
In the example used above, the 6th {\tt C} has 80\% chance of being {\tt 5mC}, 10\% chance of being {\tt 5hmC} and 10\% chance of being an unmodified {\tt C}. | ||
|
||
{\tt Ml} values for ambiguity codes give the probability that the modification is one of the possible codes compatible with that ambiguity code. | ||
For example {\tt Mm:Z:C+C,10; Ml:Z:229} indicates a C call with a probability of 90\% of having some form of unspecified modification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this Ml
tag start with Ml:Z:C,
as all the others do in the specification? Or is the C
optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be Ml:B:C
-- https://github.com/samtools/hts-specs/pull/418/files#diff-e765c6479316309f56b636f88189cdde8c40b854c7bdcce9ee7fe87a4e76febcR596
(meaning a byte array of type C
-- uint8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbrennen is correct. The Z
is clearly an error as the quality values are a byte array.
A point of confusion has come up in a couple conversations recently so thought it worth mentioning here. Given the current file format it is allowed to have multiple modifications on the same base in the same read (all good here) and each of these probabilities could take values between 0 and 1 (encoded on 0-255 scale). The issue here is if these probabilities sum to a value greater than 1, what should be the interpretation? For example it seems legal to have tags such as The note on lines 608-611 of SAMtags.tex comes close to making this point, but stops short of indicating that such probabilities summing to greater than 1 are considered invalid. I think a note explicitly indicating that this is considered invalid within the format would be helpful. Note that there is a chance that floating point and rounding errors could result in implied values slightly over 1. I don't think these cases are of grave concern, but cases where the probabilities sum to values much larger than 1 would be hard to interpret for downstream tools. |
An interesting point. The intention was to describe them as summing to something <= 256 (or as you point out perhaps marginally above is OK due to rounding errors, so we need to be clear). I'll see what I can do to improve the wording, but describing it as sums of probabilities rather than sums of integer encoding would help I think. |
Just a heads up to people, we (SAM spec maintainers) plan to merge this in around 3-weeks, as a "Draft tag" (ie half upper, half lower). Given it has already been implemented in a number of tools, we don't envisage it will stay in draft form for particularly long, perhaps in the 1-2 month time frame, before becoming officially adopted. We didn't wish to jump straight to the final tag definition as it's not until it's been actively used for a while that we have confidence there are no nasty corner cases lurking. Thank you all for your patience and help in pulling this together. |
Fantastic news! =) |
Squashed, rebased, and updated the SAMtags history from June 2019 to July 2021! (Ie ready to merge.) |
These are currently in draft form, as Mm and Ml, to be migrated to MM and ML in a month or so.
This introduced the term “top strand” without really saying what it means — see #639. |
Since the quote part was raised by @kpalin in 2019, I would like to follow up with the current strategy that samtools used to deal with the strand orientation, and any potential way to reverse the MM/MP tags? What's the corresponding relationship of the MM /ML tags between these 2 reads below?
Let's say read1 is |
I don't think there is any way you can infer what modifications another read has. If it's not measured, it's assumed to be unrecorded (unless there is an MM tag explicitly stating nothing was found). However if you were to record modifications on read 2 and found them to have modifications at the same locations that read 1 did, then it would be counting the Gs starting from the right hand end. So for read1 you skipped 2 Cs, called a C, skipped 4 more, and then called the next. Ie 3rd and 8th C:
The opposite strand is calling the 1st and 6th going 5' to 3', which would be |
This provides a couple tags for encoding base modifications as a side-channel on top of the unmodified canonical sequence. It's a slightly updated implementation of the ideas presented in #362
The tag names are somewhat arbitrarily chosen tag names; feel free to suggest improvements, but chosen "M" for modification and M/P as both are free and adjacent in the sorted list.
I'm still in two minds as to whether this method is the best place to be. We could use a simpler form which is more MD style where modifications are recorded as positional deltas to any base rather than the appropriate canonical base. This is larger, but simpler to work with. How much value do we put in the storage cost vs complexity?
A pictorial example on a tiny sequence:
If we call the method in this PR ConDelta and the simpler method AllDelta then for two levels of simulated modification on long read data (3% C->m vs 30% C->m with other C/G modification types adjusted in scale too), then our sizes are as follows:
The delta sizes here are shown as +extra amount. Ie ConDelta for BAM is 1327060+25049 bytes long. (This is obviously a small test set just for experimentation.)
The /1.9% bit in +25049/1.9% is showing what percentage of the original file that represents - ie 25049/1327060 is ~0.019.
Thus by comparing AllDelta vs ConDelta total file sizes, where see the simpler method will use in total between 0.4% (3% C meth) to 3.0% (30% C meth) extra storage. Quite small overall, despite the actual chunk of data being used for methylation signal itself being considerably larger.
NB: the tests above were all done without the MP modification quality value tag, but that doesn't change size and would have minimal impact on total file sizes.
So, discuss...