-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for targeted sequencing #163
base: master
Are you sure you want to change the base?
Conversation
The bed file is an excellent idea, and indeed i think the only way to support targetted sequencing well is (IMO) to know the targets and estimate depth on each independently. I also think it makes sense to set some limit, and once coverage goes above that, subsample reads randomly per target/amplicon. Yes, it is typical to get much higher depth with amplicon seq; it is sometimes also an intention to use this to spot minor alleles. I think we do need to set some boundaries on what we are trying to do though, we can't support everything and handling everything between 10x and 10000x depth is complicated. If people want to do exploratory work with super-high depth, they can always select reads for each amplicon, subsample those, use racon to get a consensus, and then remap all of the reads to that consensus to spot minor alleles. personally, i'd vote for something like a max read depth of 100x, and then feed straight into standard mykrobe. What do you think of that? |
Are you saying to set a max read depth on a variant, or are you talking about subsampling the reads within mykrobe? |
I'm talking about subsampling |
That seems well outside of the scope of mykrobe. Besides, subsampling in this way is not trivial and I don't think we want to go there. You can't just do a random thing like rasusa because if there's uneven coverage on your targets, it's quite possible you will lose reads from one or more targets. See mbhall88/rasusa#17 |
Don't see how it's more outside scope than taking a bedfile, or coping with 100000x depth |
When we originally added "support " for targeted sequencing we were rather cavalier, hoping a rough average depth would work. After the covid work, I'm more inclined to just accept we cannot properly support targeted sequencing and just offer something limited , if at all. Proper handling of multiple (>1) minor alleles when you have huge depth, and distinguishing nanopore homopolymer noise is going to be hard. That's why I proposed subsampling. Which, yes, requires specific code to subsample per target. |
Open to suggestions/disagreement! Could have a wrapper repo that does the bedfile and subsampling? Have a collaborator wanting to do targeted sequencing and am about to help them do subsampling plus racon (ie completely avoid Mykrobe), so interested in your opinion |
We are definitely going to be needing to use this in the future so I am happy to work up a solution (I might also try drprg on this data and see what happens). So we're going to need to do some kind of alignment to be able to subsample targets right? We could do minimap2 as it has a python API? Only pitfall here is I don't know whether or not windows can deal with that? |
My suggestion is this.
|
Alright. That will be a decent amount of work. Before I do that, I might take a look at how drprg goes on the amplicon data. |
This PR will (eventually/hopefully) close #28
Added
-z
/--compress
)-T
/--targeted
). Takes a BED file of the regions targeted to aid in the read depth estimation [[Add support for targeted sequencing #28][28]]scripts/combine.py
to combine mykrobe reports into a single CSV fileThe major feature addition in this PR is the targeted option. It takes a BED file of the regions the user has amplified and only calculates the expected depth from variants that fall within these regions.
Motivation
So this implementation has been motivated by some Nanopore targeted sequencing data I have been working with lately. There are two amplification strategies used: PCR and RPA. In particular, RPA cannot amplify an entire gene, so regions of genes that have the most variants in a WHO catalogue were chosen. One consequence of this when running vanilla mykrobe is that there are lots of variant with median depth 0/1, which end up drowing out the median depth calculate. To make matters more interesting, some of these samples were sequenced for up to 3 days, so the read depth is crazy high (~100k for some genes).
For example, here is a histogram of variant median depths for one sample (subsampled to theoretical coverage 10,000x)
The median depth (expected depth) from mykrobe is 1,074. Using this targeted approach, we get the following histogram
The median depth (expected depth) from mykrobe is 4,208. Using this targeted approach, we get the following histogram.
In addition to this, comparing DST predictions, before this option, I was finding lots of poor quality indels being called (e.g. #139; see our discussion in the slack channel too). After this change, I didn't see any of those.
One last thing to consider is whether we need to change the min. proportion expected depth. This is potentially just because of our high depth, but I have come across a case where we filter out a good call because it only has 14% of the expected depth. This isn't crazy low in the context of targeted sequencing where it is possible different genes amplify differently. I'm also not sure whether it is normal for targeted sequencing to produce much higher depths than WGS (I suspect it does) and so lowering the default from 0.3 to something like 0.1 when using the targeted option might not be insane? Thoughts?