-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consequences of using dada2 on NovaSeq data #791
Comments
The workaround right now is to enforce monotonicity in the fitted error model. The error model ( Long-term, we need to do some testing on NovaSeq data, but haven't had any to work with yet. WIth so few Q scores being used in NovaSeq, the |
Thank you, that is super helpful! I will go ahead and try it.
…--
Hannah Holland-Moritz
Doctoral Candidate, Fierer Lab
EBIO Department
University of Colorado Boulder
she/her pronouns
[email protected]
http://apipetteandanopenmind.wordpress.com/
<http://www.apipetteandanopenmind.wordpress.com>
On Fri, Jun 14, 2019 at 2:43 PM Benjamin Callahan ***@***.***> wrote:
The workaround right now is to enforce monotonicity in the fitted error
model. The error model (getErrors(errF)) is just a matrix with columns
corresponding to Q=0...41 (usually), and so is easy to modify. You can just
assign the value at Q=40 to all other entries in the row that are lower
than it. But let me know if that's not something you are comfortable doing
in R and I can put some code together. I think there might even be
something on the forum before on that.
Long-term, we need to do some testing on NovaSeq data, but haven't had any
to work with yet. WIth so few Q scores being used in NovaSeq, the loess
approach we are using to share information between nearby quality scores is
breaking a little bit, hence those dips you are seeing. So we should do
something slightly different when binned quality scores are present.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#791?email_source=notifications&email_token=AB4MVPCPKU75QMGBCHP5HNLP2P7HXA5CNFSM4HYMOFEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXX5JZI#issuecomment-502256869>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB4MVPEZWDDD646ZGYL6CF3P2P7HXANCNFSM4HYMOFEA>
.
|
@hhollandmoritz Thanks for posting these plots, we are seeing similar results with recent NovaSeq data. I'd just like to add that our sequencing center/instrument binned the quality scores differently than what you describe above. |
@jgrembi Interesting! An Illumina rep, gave me the 2, 12, 23, and 37 bins, so I just assumed that that would be the standard across all sequencing centers. Good to know it can change. As a follow-up to my original post, we ended up modifying the simulated NovaSeq error rates as suggested above and then ran some community analyses comparing the simulated NovaSeq data to the original MiSeq data. The results were encouraging. As you might expect, there were some slight differences in the rare taxa between the two datasets but overall the results were very similar, and this held true for both alpha and beta diversity metrics. In most cases the differences were not significant, so it seems that overall using dada2-generated ESVs from NovaSeq data is just fine (as long as you don't care about the rare taxa). |
@hhollandmoritz did the recommendation mentioned above by @benjjneb (#791 (comment)) act as a reasonable workaround for binning? EDIT: I should mention I work at the same biotech center that produced the oddly-binned NovaSeq data above for @jgrembi. We're digging into this with Illumina at the moment. |
Yes, it worked great! That's what we used to get the results that demonstrated there was little difference (except in rare taxa) between our simulated NovaSeq data and our original MiSeq data. |
@hhollandmoritz ah missed that in your reply, apologies! We'll implement a binning flag in our workflow for these instances, fingers crossed! |
For anyone following this: we (@jgrembi, myself, and our seq core) received an update from Illumina. It appears the binning changed in the NVCS v1.1 update to 2, 11, 25 and 37, but they neglected to update the relevant NovaSeq documentation. So this wouldn't just affect our core; it's worth checking the binning in your samples. This is being rectified by Illumina, but it's definitely worth noting if anyone has been using Q12 or higher as a cutoff for trimming. With relevance to dada2: we have a data set with the newer binning that has the same dip between 30-40 that @hhollandmoritz originally reported with the original binning; the recommended fix mentioned by @benjjneb seems to also alleviate this, but it needs further evaluation with known community data. |
Speaking of... @benjjneb should this ticket stayed closed or be re-opened? |
@hhollandmoritz Did you ever determine an optimal number of bases to sample in the |
We ended up just using 1e8 (our default), and incorporating the changes that @benjjneb suggested after the learnErrors step. At one point we did try running it for 1e9 but we saw little improvement. Since running it for that many bases takes about an hour on our server, we decided to revert back to 1e8. The most important thing, seems to be the fix recommended by @benjjneb. At the moment, though we only have the simulated data to "verify" our settings. We don't currently have any paired NovaSeq-MiSeq samples. I think you'd need that kind of data set to really be confident about the fix. |
So, I ran the
As @hhollandmoritz indicated for her data, the results were very similar. @benjjneb Any thoughts on how to interpret this/ what might be happening? |
To add a little to this, here is a small test run using default nbases but employing the Q40 adjustment @benjjneb mentioned. |
@jgrembi What we think is happening is that for binned quality scores, there are very few observations at the intermediate consensus quality scores. This isn't shown on these plots, but one way to think of this is that there are really huge points (lots of observations) at the binned quality scores, and really small points at the intermediate scores. The loess fitting accounts for this (it is a weighted loess fit) and so heavily prioritizes the fit at the binned scores, but can act a bit weirdly in between them because the weights on fitting there are very low. In our very limited testing, this is much less of an issue than it appears, for the reason that few observed bases are at these positions with intermediate consensus qualtiy scores, but we'd like to doo great, not just good enough. The monotonicity enforcement appeared to help in very limited testing data, but with more data we could say something with enough confidence to add it to the package. |
@benjjneb we do have some mock Zymo NovaSeq, but it is V3-V4; I assume you preferably want V4? |
Not at all. We are completely ambivalent about the locus sequenced. That would be a great dataset to test on. |
@benjjneb I wanted to ensure my results from above were not due to random chance, so I also estimated error rates for the reverse reads of the same run at the various So, I guess my main questions are:
Thanks! |
Sorry, but no. I don't fully understand this phenomenon in this case.
I'll put it this way: If it was a study I cared about being accurate on, and the additional computational burden was tractable for me, I would do it.
I haven't yet got any NovaSeq test datasets from samples of known composition. |
Awaiting permission from the PI on this one (it's not our data, unfortunately) |
Hi all, As for the plots, I read about the workaround described here by @benjjneb and I was wondering if you can share some R code to do this (I'm really just a beginner using R)? Thanks in advance! |
@benjjneb I didn't have much luck in getting NovaSeq Zymo data, but I did a search and there is one NovaSeq Zymo control sample in SRA in project PRJEB36316. No obvious notes on which region was analyzed; it appears to be part of a larger ~100 sample study (so it should be feasible to grab all samples). |
Here is a snippet of some code that I used when I was trying to run a comparison on Miseq data and simulated NovaSeq data that I created by converting the fastq quality scores. In this example anything that starts with # Learn forward error rates
errF <- learnErrors(filtFs, nbases=1e8, multithread=TRUE)
NSerrF <- learnErrors(NSfiltFs, nbases=1e8, multithread=TRUE)
NSerrF_mon <- NSerrF
NSnew_errF_out <- matrix(rep(getErrors(NSerrF_mon)[,40], length.out = 40*16), ncol = 40)
# Learn reverse error rates
errR <- learnErrors(filtRs, nbases=1e8, multithread=TRUE)
NSerrR <- learnErrors(NSfiltRs, nbases=1e8, multithread=TRUE)
NSerrR_mon <- NSerrR
# assign any value lower than the Q40 probablity to be the Q40 value
NSnew_errR_out <- getErrors(NSerrR_mon) %>%
data.frame() %>%
mutate_all(funs(case_when(. < X40 ~ X40,
. >= X40 ~ .))) %>% as.matrix()
rownames(NSnew_errR_out) <- rownames(getErrors(NSerrR_mon))
colnames(NSnew_errR_out) <- colnames(getErrors(NSerrR_mon))
NSerrR_mon$err_out <- NSnew_errR_out
#' #### Plot Error Rates
errF_plot <- plotErrors(errF, nominalQ=TRUE)
NSerrF_plot <- plotErrors(NSerrF, nominalQ=TRUE)
NSerrF_mon_plot
errF_plot
NSerrF_plot
errR_plot <- plotErrors(errR, nominalQ=TRUE)
NSerrR_plot <- plotErrors(NSerrR, nominalQ=TRUE)
NSerrR_mon_plot <- plotErrors(NSerrR_mon, nominalQ = TRUE)
errR_plot
NSerrR_plot |
Thanks a lot @hhollandmoritz for sharing! |
Dada2 only supports illumina Hiseq and Miseq platfrom now, and must make some modifications to process X-ten and Novaseq data, is it ? @benjjneb |
@wangjiawen2013 See this comment: #791 (comment) |
@benjjneb Most of the companies are equipped with X-ten and Novaseq here for it's high-throughput, speed and lower sequencing costs. So this issue will emerge more and more later on. I also noticed the similar problems was raised in qiime2 forum (when executed qiime dada2 denoise-paired/single-paired command). |
Hi @hhollandmoritz |
Hi @JacobRPrice, I didn't explore that no. It's been a while, but I believe what ended up happening was that we found that the differences between our simulated NovaSeq data and our MiSeq data were fairly minor, and mostly in the rare part of the community (which we were less concerned about). So we just used the fix suggested above, and continued with our analysis as usual. We never did end up comparing samples that had been sequenced on both platforms (it wasn't really on the cards for the funding available for that project). Hope this helps! Hannah |
@JacobRPrice Will add that we did a simple comparison and found the same as @hhollandmoritz, though may revisit this again once we have comparable data. Having a nice defined community sample sequenced on both platforms would be best for comparison, however. |
hi, |
Hi @benjjneb, |
Here's some code that should do the trick:
|
I may be running this code incorrectly, but the errF.md object is a matrix, not the named list like the errF object, so |
@EmilyB17 Yes The hacky way to do this is to just put this matrix back into a copy of the original richer error model object that
|
Hi @benjjneb. I'm experiencing this problem with my PE 150bp iSeq data. I originally tried to run it through Qiime2, but got sent here to run it through dada2 because of the new iSeq quality scores. My data only have 3 scores: 11, 25, and 37. After reading through this thread, I'm still unclear how to get dada2 to estimate error frequencies with these new Q scores. Specifically, the two solutions proposed here involve running Is this a different issue? Or am I doing something wrong here? Note: a few samples have so few reads that they all get dropped during filtering. That's expected. Even when I re-do these steps without those low quality samples, it still fails at Here is what I've run, and the resulting error frequencies for R1:
|
@alexkrohn did you try any of the above methods (e.g. from @JacobRPrice or @hhollandmoritz )? Also, the iSeq doesn't produce a ton of data. How many reads do you get per sample? |
@cjfields I didn't because they both require running We're doing an eDNA project to identify vertebrates using a 16s fragment. Previous work has shown that we're able to detect samples with ~50k reads per sample, so that's what we've aimed for here. This run was a test with iSeq, and some samples obviously failed in the lab work/sequencing phases. I expect to get at least some results from the samples with >50k reads.
|
@alexkrohn Have you tried adjusting the nbases argument in learnErrors? You might not have enough bases to hit the default 1e+08 (a quick back of the envelope calculation based on the reads.out numbers above suggests you're close to 1e+08). |
Thanks for the tips, @jgrembi. You're right, I was calculating maxEE way wrong, and corrected that. I also now truncate R2 at 125 bp during the filtering step. Still my error persists even when Edit:
|
@alexkrohn Out of curiosity, what is the # total bases, total reads, and # samples used for learning error rates for the Fwd reads? |
@jgrembi For this round of filtering with
|
Dear @benjjneb - are there any developments on this? have you been successful in testing the approaches you suggested on mock data? if you haven't got around to it yet, we have a group of people in a project who are very keen on this and would support you in testing, if you have a data set and some pointers on what are the most promising leads. Please let me know. - Anna |
@a-h-b Have you come across this issue? I think the more recent discussion on this issue has been happening in that thread. In my group, at the moment, we're trying a mix of error rate learning modifications discussed in that thread and choosing whichever looks best. |
@hhollandmoritz cool thank you |
Hello,
We have an amplicon dataset from a NovaSeq run and are exploring how we might alter settings in the dada2 pipeline to effectively identify errors in our data. In case you are unfamiliar, NovaSeq generates up to 10 billion reads per flow cell and one of the ways Illumina deals with storing the massive amount of data generated by the NovaSeq is to simplify the error rates by binning the 40 possible quality scores into just 4 categories which vastly reduces the amount of information dada2 can work off of to infer errors in the data.
Furthermore, the error-rate conversions are as follows:
0-2 -> 2
3-14 -> 12
15-30 -> 23
31-40 -> 37
So in some cases, error is being overestimated by the conversion (e.g. a score of 30 which is labelled 23) and in other cases it is being underestimated (e.g. a score of 31 being labelled 37).
I see there as being two main places that this "binned" quality score has consequences, the quality filtering and the error-rate learning step. I'm less worried about the quality filtering as that is pretty easy to adjust the settings on, but I was wondering if you have suggestions about the ways we might alter the parameters of
learnErrors
to better estimate NovaSeq error rates.The first problem we encountered was the nbases parameter. NovaSeq runs are so large that with nbases set to 1x10^8 (our usual default) only one sample was being used to judge error rates. Do you have any recommendations for the minimum number of samples that should be used as the basis for error-learning?
The second issue is the error estimation itself. When we run the
learnErrors
command on both our real NovaSeq data and simulated NovaSeq data (MiSeq data that we converted to have NovaSeq-style binned errors) we see a pretty characteristic error plot.Simluated data:
Real NovaSeq data:
Pretty consistently, error plots underestimate the error frequency in certain ranges of the quality score landscape. In particular, they underestimate it in the 30-40 range (error plot models show a consistent "dip" in this region) and vastly over-estimate it in some parts of the 10-25 range. Do you have any recommendations about changes we might make to our analysis pipeline to improve the error estimation at this step?
Thanks so much!
Hannah
The text was updated successfully, but these errors were encountered: