Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consequences of using dada2 on NovaSeq data #791

Open
hhollandmoritz opened this issue Jun 14, 2019 · 48 comments
Open

Consequences of using dada2 on NovaSeq data #791

hhollandmoritz opened this issue Jun 14, 2019 · 48 comments

Comments

@hhollandmoritz
Copy link

Hello,

We have an amplicon dataset from a NovaSeq run and are exploring how we might alter settings in the dada2 pipeline to effectively identify errors in our data. In case you are unfamiliar, NovaSeq generates up to 10 billion reads per flow cell and one of the ways Illumina deals with storing the massive amount of data generated by the NovaSeq is to simplify the error rates by binning the 40 possible quality scores into just 4 categories which vastly reduces the amount of information dada2 can work off of to infer errors in the data.

Furthermore, the error-rate conversions are as follows:
0-2 -> 2
3-14 -> 12
15-30 -> 23
31-40 -> 37

So in some cases, error is being overestimated by the conversion (e.g. a score of 30 which is labelled 23) and in other cases it is being underestimated (e.g. a score of 31 being labelled 37).

I see there as being two main places that this "binned" quality score has consequences, the quality filtering and the error-rate learning step. I'm less worried about the quality filtering as that is pretty easy to adjust the settings on, but I was wondering if you have suggestions about the ways we might alter the parameters of learnErrors to better estimate NovaSeq error rates.

The first problem we encountered was the nbases parameter. NovaSeq runs are so large that with nbases set to 1x10^8 (our usual default) only one sample was being used to judge error rates. Do you have any recommendations for the minimum number of samples that should be used as the basis for error-learning?

The second issue is the error estimation itself. When we run the learnErrors command on both our real NovaSeq data and simulated NovaSeq data (MiSeq data that we converted to have NovaSeq-style binned errors) we see a pretty characteristic error plot.

Simluated data:
simulated_NovaSeq_errR_plot

Real NovaSeq data:
NovaSeq_errR_plot2

Pretty consistently, error plots underestimate the error frequency in certain ranges of the quality score landscape. In particular, they underestimate it in the 30-40 range (error plot models show a consistent "dip" in this region) and vastly over-estimate it in some parts of the 10-25 range. Do you have any recommendations about changes we might make to our analysis pipeline to improve the error estimation at this step?

Thanks so much!
Hannah

@benjjneb
Copy link
Owner

The workaround right now is to enforce monotonicity in the fitted error model. The error model (getErrors(errF)) is just a matrix with columns corresponding to Q=0...41 (usually), and so is easy to modify. You can just assign the value at Q=40 to all other entries in the row that are lower than it. But let me know if that's not something you are comfortable doing in R and I can put some code together. I think there might even be something on the forum before on that.

Long-term, we need to do some testing on NovaSeq data, but haven't had any to work with yet. WIth so few Q scores being used in NovaSeq, the loess approach we are using to share information between nearby quality scores is breaking a little bit, hence those dips you are seeing. So we should do something slightly different when binned quality scores are present.

@hhollandmoritz
Copy link
Author

hhollandmoritz commented Jun 14, 2019 via email

@jgrembi
Copy link

jgrembi commented Nov 10, 2019

@hhollandmoritz Thanks for posting these plots, we are seeing similar results with recent NovaSeq data.

I'd just like to add that our sequencing center/instrument binned the quality scores differently than what you describe above.
Ours were binned at 2, 11, 25, and 37 instead of 2, 12, 23, and 37. This caused a problem when using the truncQ = 11 suggested in the dada2 Big Data workflow as it caused us to lose ~90% of our reads.
Just wanted to let others know to check since the quality score binning doesn't seem to be standardized.

@hhollandmoritz
Copy link
Author

hhollandmoritz commented Nov 10, 2019

@jgrembi Interesting! An Illumina rep, gave me the 2, 12, 23, and 37 bins, so I just assumed that that would be the standard across all sequencing centers. Good to know it can change.

As a follow-up to my original post, we ended up modifying the simulated NovaSeq error rates as suggested above and then ran some community analyses comparing the simulated NovaSeq data to the original MiSeq data. The results were encouraging. As you might expect, there were some slight differences in the rare taxa between the two datasets but overall the results were very similar, and this held true for both alpha and beta diversity metrics. In most cases the differences were not significant, so it seems that overall using dada2-generated ESVs from NovaSeq data is just fine (as long as you don't care about the rare taxa).

@cjfields
Copy link

cjfields commented Nov 10, 2019

@hhollandmoritz did the recommendation mentioned above by @benjjneb (#791 (comment)) act as a reasonable workaround for binning?

EDIT: I should mention I work at the same biotech center that produced the oddly-binned NovaSeq data above for @jgrembi. We're digging into this with Illumina at the moment.

@hhollandmoritz
Copy link
Author

Yes, it worked great! That's what we used to get the results that demonstrated there was little difference (except in rare taxa) between our simulated NovaSeq data and our original MiSeq data.

@cjfields
Copy link

@hhollandmoritz ah missed that in your reply, apologies! We'll implement a binning flag in our workflow for these instances, fingers crossed!

@cjfields
Copy link

For anyone following this: we (@jgrembi, myself, and our seq core) received an update from Illumina. It appears the binning changed in the NVCS v1.1 update to 2, 11, 25 and 37, but they neglected to update the relevant NovaSeq documentation. So this wouldn't just affect our core; it's worth checking the binning in your samples.

This is being rectified by Illumina, but it's definitely worth noting if anyone has been using Q12 or higher as a cutoff for trimming. With relevance to dada2: we have a data set with the newer binning that has the same dip between 30-40 that @hhollandmoritz originally reported with the original binning; the recommended fix mentioned by @benjjneb seems to also alleviate this, but it needs further evaluation with known community data.

@cjfields
Copy link

Speaking of... @benjjneb should this ticket stayed closed or be re-opened?

@benjjneb benjjneb reopened this Nov 11, 2019
@jgrembi
Copy link

jgrembi commented Nov 14, 2019

@hhollandmoritz Did you ever determine an optimal number of bases to sample in the learnErrors step for your NovaSeq run?

@hhollandmoritz
Copy link
Author

We ended up just using 1e8 (our default), and incorporating the changes that @benjjneb suggested after the learnErrors step. At one point we did try running it for 1e9 but we saw little improvement. Since running it for that many bases takes about an hour on our server, we decided to revert back to 1e8. The most important thing, seems to be the fix recommended by @benjjneb.

At the moment, though we only have the simulated data to "verify" our settings. We don't currently have any paired NovaSeq-MiSeq samples. I think you'd need that kind of data set to really be confident about the fix.

@jgrembi
Copy link

jgrembi commented Nov 16, 2019

So, I ran the learnErrors command for a range of nbases, mostly because I was nervous about only using a single sample to learn error rates.

nbases = 10^8
#118265865 total bases in 503259 reads from 1 sample used for learning the error rates, running time was a few minutes

nbases = 10^9
#1080609430 total bases in 4598338 reads from 9 samples used for learning the error rates, running time was about an hour

nbases = 10^10
#10145536090 total bases in 43172494 reads from 81 samples used for learning the error rates, running time was about 24hr

Here are the results:
errPlots

As @hhollandmoritz indicated for her data, the results were very similar.
However, if you look closely at G2C and G2T in the nbases = 10^10 scenario you see the dip in error frequency disappear. There is also a shift in the final error estimate by a factor of 10^2 (G2T at nbases = 10^8 is 3.3x10^-4 but at nbases = 10^10 it is 4.3x10^-6), which seems to be caused by the removal of the final point present in the nbases = 10^8 and nbases = 10^9 plots.

@benjjneb Any thoughts on how to interpret this/ what might be happening?

@cjfields
Copy link

To add a little to this, here is a small test run using default nbases but employing the Q40 adjustment @benjjneb mentioned.

Normal (no correction):
R1 uncorrected err

With Q40 correction:
R1 corrected err

@benjjneb
Copy link
Owner

@jgrembi What we think is happening is that for binned quality scores, there are very few observations at the intermediate consensus quality scores. This isn't shown on these plots, but one way to think of this is that there are really huge points (lots of observations) at the binned quality scores, and really small points at the intermediate scores. The loess fitting accounts for this (it is a weighted loess fit) and so heavily prioritizes the fit at the binned scores, but can act a bit weirdly in between them because the weights on fitting there are very low.

In our very limited testing, this is much less of an issue than it appears, for the reason that few observed bases are at these positions with intermediate consensus qualtiy scores, but we'd like to doo great, not just good enough. The monotonicity enforcement appeared to help in very limited testing data, but with more data we could say something with enough confidence to add it to the package.

@cjfields
Copy link

@benjjneb we do have some mock Zymo NovaSeq, but it is V3-V4; I assume you preferably want V4?

@benjjneb
Copy link
Owner

we do have some mock Zymo NovaSeq, but it is V3-V4; I assume you preferably want V4?

Not at all. We are completely ambivalent about the locus sequenced. That would be a great dataset to test on.

@jgrembi
Copy link

jgrembi commented Nov 25, 2019

@benjjneb
Your explanation above for how the loess model works makes sense. But I'm not sure if it explains why the Q37 point disappears from some of my plots (G2C and G2T error frequencies only) when I sample more bases. This results in the final value at Q36 being used to extrapolate values at Q37-Q40 when nbases = 10^10, which leads to the Q40 value having a difference in magnitude of 10^2 from when sampling nbases = 10^8/10^9. I worry this might not be a trivial difference. If we sample only 10^8 or 10^9 bases, we will enforce monotonicity on a value that provides 10^2 higher error frequency than if we train the model on 10^10 bases. I don't have an intuition for how the dada2 results would change with error estimates that differ by this much.

I wanted to ensure my results from above were not due to random chance, so I also estimated error rates for the reverse reads of the same run at the various nbases values. The result was similar to what was present in the forward reads in the G2C and G2T error frequencies, except in the reverse reads it is found in the C2A and C2G plots (which makes sense):

err_rev_Plots

So, I guess my main questions are:

  • Do you have any idea why sampling 10^10 bases leads to the Q37 point being lost/dropped for these few (G2C/G2T & C2A/C2G) instances? (At first I thought that it was indicative of less Q37 quality scores available for the nucleotide in question across the larger number of bases sampled, but if I understand the loess model correctly, it would just give less weight to the Q37 value if less data is there but not eliminate it completely, right?)

  • How important do you think a difference in magnitude of 10^2 for error frequency is for the final dada ASV results? ( I will happily run nbases = 10^10 and suffer the computational burden if it gives me a more accurate result.)

  • Also, have you had a chance to check the monotonicity enforcement against the data sent by @cjfields?

Thanks!

@benjjneb
Copy link
Owner

Do you have any idea why sampling 10^10 bases leads to the Q37 point being lost/dropped for these few (G2C/G2T & C2A/C2G) instances?

Sorry, but no. I don't fully understand this phenomenon in this case.

How important do you think a difference in magnitude of 10^2 for error frequency is for the final dada ASV results? ( I will happily run nbases = 10^10 and suffer the computational burden if it gives me a more accurate result.)

I'll put it this way: If it was a study I cared about being accurate on, and the additional computational burden was tractable for me, I would do it.

Also, have you had a chance to check the monotonicity enforcement against the data sent by @cjfields?

I haven't yet got any NovaSeq test datasets from samples of known composition.

@cjfields
Copy link

Also, have you had a chance to check the monotonicity enforcement against the data sent by @cjfields?

I haven't yet got any NovaSeq test datasets from samples of known composition.

Awaiting permission from the PI on this one (it's not our data, unfortunately)

@avocj77
Copy link

avocj77 commented Apr 22, 2020

Hi all,
I'm analyzing 16s V3-V4 amplicon sequencing data generated form a Novaseq run and I encounter the same plots as described here. In my case 4 samples were used (I have 14 samples in total) to learn the error rates. I suppose that this amount of samples to learn the error rate is quite ok?

afbeelding

As for the plots, I read about the workaround described here by @benjjneb and I was wondering if you can share some R code to do this (I'm really just a beginner using R)?
I was also wondering what the consequence would be, if I would let the plots untouched and continue the pipeline without the suggested workaround?

Thanks in advance!

@cjfields
Copy link

cjfields commented Apr 22, 2020

@benjjneb I didn't have much luck in getting NovaSeq Zymo data, but I did a search and there is one NovaSeq Zymo control sample in SRA in project PRJEB36316. No obvious notes on which region was analyzed; it appears to be part of a larger ~100 sample study (so it should be feasible to grab all samples).

@hhollandmoritz
Copy link
Author

Here is a snippet of some code that I used when I was trying to run a comparison on Miseq data and simulated NovaSeq data that I created by converting the fastq quality scores.

In this example anything that starts with NS is simulated NovaSeq data. It also requires that the package dplyr, or tidyverse be installed and loaded.

# Learn forward error rates
errF <- learnErrors(filtFs, nbases=1e8, multithread=TRUE)

NSerrF <- learnErrors(NSfiltFs, nbases=1e8, multithread=TRUE)

NSerrF_mon <- NSerrF
NSnew_errF_out <- matrix(rep(getErrors(NSerrF_mon)[,40], length.out = 40*16), ncol = 40)

# Learn reverse error rates
errR <- learnErrors(filtRs, nbases=1e8, multithread=TRUE)
NSerrR <- learnErrors(NSfiltRs, nbases=1e8, multithread=TRUE)
NSerrR_mon <- NSerrR

# assign any value lower than the Q40 probablity to be the Q40 value
NSnew_errR_out <- getErrors(NSerrR_mon) %>%
  data.frame() %>%
  mutate_all(funs(case_when(. < X40 ~ X40,
                            . >= X40 ~ .))) %>% as.matrix()
rownames(NSnew_errR_out) <- rownames(getErrors(NSerrR_mon))
colnames(NSnew_errR_out) <- colnames(getErrors(NSerrR_mon))
NSerrR_mon$err_out <- NSnew_errR_out



#' #### Plot Error Rates

errF_plot <- plotErrors(errF, nominalQ=TRUE)
NSerrF_plot <- plotErrors(NSerrF, nominalQ=TRUE)
NSerrF_mon_plot

errF_plot
NSerrF_plot

errR_plot <- plotErrors(errR, nominalQ=TRUE)
NSerrR_plot <- plotErrors(NSerrR, nominalQ=TRUE)
NSerrR_mon_plot <- plotErrors(NSerrR_mon, nominalQ = TRUE)

errR_plot
NSerrR_plot

@avocj77
Copy link

avocj77 commented Apr 28, 2020

Thanks a lot @hhollandmoritz for sharing!

@wangjiawen2013
Copy link

wangjiawen2013 commented May 3, 2020

Dada2 only supports illumina Hiseq and Miseq platfrom now, and must make some modifications to process X-ten and Novaseq data, is it ? @benjjneb

@benjjneb
Copy link
Owner

benjjneb commented May 4, 2020

@wangjiawen2013 See this comment: #791 (comment)

@wangjiawen2013
Copy link

wangjiawen2013 commented May 8, 2020

@benjjneb Most of the companies are equipped with X-ten and Novaseq here for it's high-throughput, speed and lower sequencing costs. So this issue will emerge more and more later on. I also noticed the similar problems was raised in qiime2 forum (when executed qiime dada2 denoise-paired/single-paired command).

@JacobRPrice
Copy link

Here is a snippet of some code that I used when I was trying to run a comparison on Miseq data and simulated NovaSeq data that I created by converting the fastq quality scores.

In this example anything that starts with NS is simulated NovaSeq data. It also requires that the package dplyr, or tidyverse be installed and loaded.

# Learn forward error rates
errF <- learnErrors(filtFs, nbases=1e8, multithread=TRUE)

NSerrF <- learnErrors(NSfiltFs, nbases=1e8, multithread=TRUE)

NSerrF_mon <- NSerrF
NSnew_errF_out <- matrix(rep(getErrors(NSerrF_mon)[,40], length.out = 40*16), ncol = 40)

# Learn reverse error rates
errR <- learnErrors(filtRs, nbases=1e8, multithread=TRUE)
NSerrR <- learnErrors(NSfiltRs, nbases=1e8, multithread=TRUE)
NSerrR_mon <- NSerrR

# assign any value lower than the Q40 probablity to be the Q40 value
NSnew_errR_out <- getErrors(NSerrR_mon) %>%
  data.frame() %>%
  mutate_all(funs(case_when(. < X40 ~ X40,
                            . >= X40 ~ .))) %>% as.matrix()
rownames(NSnew_errR_out) <- rownames(getErrors(NSerrR_mon))
colnames(NSnew_errR_out) <- colnames(getErrors(NSerrR_mon))
NSerrR_mon$err_out <- NSnew_errR_out



#' #### Plot Error Rates

errF_plot <- plotErrors(errF, nominalQ=TRUE)
NSerrF_plot <- plotErrors(NSerrF, nominalQ=TRUE)
NSerrF_mon_plot

errF_plot
NSerrF_plot

errR_plot <- plotErrors(errR, nominalQ=TRUE)
NSerrR_plot <- plotErrors(NSerrR, nominalQ=TRUE)
NSerrR_mon_plot <- plotErrors(NSerrR_mon, nominalQ = TRUE)

errR_plot
NSerrR_plot

Hi @hhollandmoritz
Thank you for posting your process. I know its been a little since you commented, but would you happen to remember if you investigated the impact of setting the value of error rates < Q40 to be the Q40 value within dada() or learnErrors()? If you did, did you find that to help model fitment?
Thanks,
Jake

@hhollandmoritz
Copy link
Author

Hi @JacobRPrice,

I didn't explore that no. It's been a while, but I believe what ended up happening was that we found that the differences between our simulated NovaSeq data and our MiSeq data were fairly minor, and mostly in the rare part of the community (which we were less concerned about). So we just used the fix suggested above, and continued with our analysis as usual. We never did end up comparing samples that had been sequenced on both platforms (it wasn't really on the cards for the funding available for that project). Hope this helps!

Hannah

@cjfields
Copy link

@JacobRPrice Will add that we did a simple comparison and found the same as @hhollandmoritz, though may revisit this again once we have comparable data. Having a nice defined community sample sequenced on both platforms would be best for comparison, however.

@XinyueZHang-hub
Copy link

@jgrembi Interesting! An Illumina rep, gave me the 2, 12, 23, and 37 bins, so I just assumed that that would be the standard across all sequencing centers. Good to know it can change.

As a follow-up to my original post, we ended up modifying the simulated NovaSeq error rates as suggested above and then ran some community analyses comparing the simulated NovaSeq data to the original MiSeq data. The results were encouraging. As you might expect, there were some slight differences in the rare taxa between the two datasets but overall the results were very similar, and this held true for both alpha and beta diversity metrics. In most cases the differences were not significant, so it seems that overall using dada2-generated ESVs from NovaSeq data is just fine (as long as you don't care about the rare taxa).

hi,
I noticed you said using dada2-generated ESVs from NovaSeq data is just fine, and I'm so happy you can help me about the question confused me several days.
These are my demux-seqs file from company (using NovaSeq to text), fist picture (total 220bp) has already cut the barcode and primer, and second (total 250bp) hasn't yet. I want to know what trim or trunk I should choose when I want to use dada2 to denoise for them. I tried to set trim 0 and trunk 220 for first, and trim 24 (forward), 26 (reserve), trunk 250 (forward), 248 (reserve)for second, because total barcode + primer are 24/26bp. But I don't know if it is reasonable, so I really hope get some suggestions from you. Thank you so much !
Good luck !
截屏2021-05-28 13 10 02
截屏2021-05-28 21 01 34

@anique125
Copy link

The workaround right now is to enforce monotonicity in the fitted error model. The error model (getErrors(errF)) is just a matrix with columns corresponding to Q=0...41 (usually), and so is easy to modify. You can just assign the value at Q=40 to all other entries in the row that are lower than it. But let me know if that's not something you are comfortable doing in R and I can put some code together. I think there might even be something on the forum before on that.

Long-term, we need to do some testing on NovaSeq data, but haven't had any to work with yet. WIth so few Q scores being used in NovaSeq, the loess approach we are using to share information between nearby quality scores is breaking a little bit, hence those dips you are seeing. So we should do something slightly different when binned quality scores are present.

Hi @benjjneb,
Were you able to implement a long-term solution for NovaSeq data in the pipeline? If not can you please share the code to enforce monotonicity in the fitted error model?

@benjjneb
Copy link
Owner

Here's some code that should do the trick:

make.monotone.decreasing <- function(v) sapply(seq_along(v), function(i) max(v[i:length(v)]))
errF.md <- t(apply(getErrors(errF), 1, make.monotone.decreasing))

@emilyvansyoc
Copy link

Here's some code that should do the trick:

make.monotone.decreasing <- function(v) sapply(seq_along(v), function(i) max(v[i:length(v)]))
errF.md <- t(apply(getErrors(errF), 1, make.monotone.decreasing))

I may be running this code incorrectly, but the errF.md object is a matrix, not the named list like the errF object, so plotErrors doesn't work. Is there still a way to make error plots of the enforced monotonicity to check before running the downstream pipeline?

@benjjneb
Copy link
Owner

@EmilyB17 Yes plotErrors doesn't handle that bare error rate matrix. Maybe it should be extended to do that...

The hacky way to do this is to just put this matrix back into a copy of the original richer error model object that plotErrors recognizes.

errF.md.full <- errF
errF.md.full$err_out <- errF.md
plotErrors(errF.md.full)

@alexkrohn
Copy link

Hi @benjjneb. I'm experiencing this problem with my PE 150bp iSeq data. I originally tried to run it through Qiime2, but got sent here to run it through dada2 because of the new iSeq quality scores. My data only have 3 scores: 11, 25, and 37. After reading through this thread, I'm still unclear how to get dada2 to estimate error frequencies with these new Q scores.

Specifically, the two solutions proposed here involve running learnErrors(), then changing the error matrix, but my R2 reads fail at the learnErrors() step.

Is this a different issue? Or am I doing something wrong here?

Note: a few samples have so few reads that they all get dropped during filtering. That's expected. Even when I re-do these steps without those low quality samples, it still fails at learnErrors()

Here is what I've run, and the resulting error frequencies for R1:

# Find R1 and R2 reads
fnFs <- sort(list.files(path.to.files, pattern="_R1_001.fastq", full.names = TRUE))
fnRs <- sort(list.files(path.to.files, pattern="_R2_001.fastq", full.names = TRUE))

# Extract the sample names 
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path.to.files, "filtered", paste0(sample.names, "_F_filtered.fastq.gz"))
filtRs <- file.path(path.to.files, "filtered", paste0(sample.names, "_R_filtered.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names

# Filter and trim
# F primer is 24 bp, R primer is 21 bp
# Min Q score is 11, so 10 bases with Q scores of 11 or less would be an EE score 0.03, so let's set that as a max
tight.filters <- filterAndTrim(fwd = fnFs, filt = filtFs, rev = fnRs, filt.rev = filtRs, trimLeft = c(24, 21), rm.phix = TRUE, multithread = TRUE, maxEE = c(0.3,0.3))

## Learn the error model
errF <- learnErrors(filtFs, multithread = TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
Error rates could not be estimated (this is usually because of very few reads).
Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

R1 error frequencies

@cjfields
Copy link

@alexkrohn did you try any of the above methods (e.g. from @JacobRPrice or @hhollandmoritz )? Also, the iSeq doesn't produce a ton of data. How many reads do you get per sample?

@alexkrohn
Copy link

alexkrohn commented Jan 13, 2022

@cjfields I didn't because they both require running learnErrors on the R2 sequences, which fails for me. (E.g. NSerrR <- learnErrors(NSfiltRs, nbases=1e8, multithread=TRUE) for @hhollandmoritz's solution.

We're doing an eDNA project to identify vertebrates using a 16s fragment. Previous work has shown that we're able to detect samples with ~50k reads per sample, so that's what we've aimed for here. This run was a test with iSeq, and some samples obviously failed in the lab work/sequencing phases. I expect to get at least some results from the samples with >50k reads. learnErrors still fails for R2 even when I exclude the low read samples. Here's the output from filterAndTrim():

             reads.in    reads.out
ED3886-16s_S1_L001_R1_001.fastq.gz    335906    125712
ED3887-16s_S2_L001_R1_001.fastq.gz      5816       859
ED3888-16s_S3_L001_R1_001.fastq.gz    557994    223576
ED3889-16s_S4_L001_R1_001.fastq.gz    481661    177693
ED3890-16s_S5_L001_R1_001.fastq.gz    454951    183251
ED3891-16s_S6_L001_R1_001.fastq.gz    489912    178256
ED3892-16s_S7_L001_R1_001.fastq.gz    521042    197634
ED3893-16s_S8_L001_R1_001.fastq.gz    425175    171421
ED3894-16s_S9_L001_R1_001.fastq.gz    431317    159944
ED3896-16s_S11_L001_R1_001.fastq.gz        3         0
ED3898-16s_S13_L001_R1_001.fastq.gz        7         1
ED3899-16s_S14_L001_R1_001.fastq.gz    12596      5481
ED3900-16s_S15_L001_R1_001.fastq.gz        4         0
ED3901-16s_S16_L001_R1_001.fastq.gz   407996    174026
ED3902-16s_S17_L001_R1_001.fastq.gz      177         1
ED3904-16s_S19_L001_R1_001.fastq.gz   541007    211779
ED3905-16s_S20_L001_R1_001.fastq.gz    55399     23540
ED3906-16s_S21_L001_R1_001.fastq.gz     2647      1111
ED3907-16s_S22_L001_R1_001.fastq.gz        6         1
ED3908-16s_S23_L001_R1_001.fastq.gz        9         3
ED3909-16s_S24_L001_R1_001.fastq.gz      239       105
ED3910-16s_S25_L001_R1_001.fastq.gz        1         0
ED3911-16s_S26_L001_R1_001.fastq.gz    19941      1500
ED3912-16s_S27_L001_R1_001.fastq.gz    96821     43858
ED3914-16s_S29_L001_R1_001.fastq.gz        5         0
ED3916-16s_S31_L001_R1_001.fastq.gz      774        67
ED3917-16s_S32_L001_R1_001.fastq.gz        7         0
ED3918-16s_S33_L001_R1_001.fastq.gz       11         0
ED3920-16s_S35_L001_R1_001.fastq.gz        4         0
ED4870-16s_S38_L001_R1_001.fastq.gz        5         0

@jgrembi
Copy link

jgrembi commented Jan 13, 2022

@alexkrohn Have you tried adjusting the nbases argument in learnErrors? You might not have enough bases to hit the default 1e+08 (a quick back of the envelope calculation based on the reads.out numbers above suggests you're close to 1e+08).
Also, I've never used the maxEE argument from filterAndTrim so I'm not positive (i.e. read this next part with a bit of skepticism cause I could be wrong), but your value of 0.3 seems very low to me. Based on what I understand, higher maxEE is equal to lower quality. So, you're setting a VERY HIGH quality threshold for your data, and possibly filtering out lots of decent quality reads. You also don't appear to be doing any trimming of low quality scores near the end of the reads (especially important for Reverse reads which usually have lower quality scores - maybe that's why you're having issues in the R2 files?). When combined with the low maxEE you've set, this might again be further unnecessarily reducing the number of reads that are passing the filterAndTrim step.
Good luck figuring it out!

@alexkrohn
Copy link

alexkrohn commented Jan 13, 2022

Thanks for the tips, @jgrembi. You're right, I was calculating maxEE way wrong, and corrected that. I also now truncate R2 at 125 bp during the filtering step.

Still my error persists even when maxEE is Inf, and when I set nbases as low as 100. There are 4.8 million R2 reads remaining post-filter with maxEE = Inf and truncating at 125bp.

Edit: nbases = Inf still gives the same error.

> errR <- learnErrors(filtRs, nbases = Inf, multithread=TRUE)
501060768 total bases in 4817892 reads from 13 samples will be used for learning the error rates.
Error rates could not be estimated (this is usually because of very few reads).
Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.

@jgrembi
Copy link

jgrembi commented Jan 13, 2022

@alexkrohn Out of curiosity, what is the # total bases, total reads, and # samples used for learning error rates for the Fwd reads?

@alexkrohn
Copy link

@jgrembi For this round of filtering with maxEE = Inf, the forward reads work without a problem even at 1e8 bases.

> errF <- learnErrors(filtFs, nbases = 1e8, multithread = TRUE)
112603680 total bases in 893680 reads from 2 samples will be used for learning the error rates.

@a-h-b
Copy link

a-h-b commented Feb 2, 2022

Dear @benjjneb - are there any developments on this? have you been successful in testing the approaches you suggested on mock data? if you haven't got around to it yet, we have a group of people in a project who are very keen on this and would support you in testing, if you have a data set and some pointers on what are the most promising leads. Please let me know. - Anna

@hhollandmoritz
Copy link
Author

@a-h-b Have you come across this issue? I think the more recent discussion on this issue has been happening in that thread. In my group, at the moment, we're trying a mix of error rate learning modifications discussed in that thread and choosing whichever looks best.
https://github.com/ErnakovichLab/dada2_ernakovichlab#learn-the-error-rates

@a-h-b
Copy link

a-h-b commented Feb 2, 2022

@hhollandmoritz cool thank you

@marwa38
Copy link

marwa38 commented Dec 17, 2023

@benjjneb
is a current pipeline for novoseq analysis?
This might be one of the reasons I am encountering this issue? #1851

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests