-
Notifications
You must be signed in to change notification settings - Fork 5
ReDeeM filtering strategies
ReDeeM is developed by modifying the single-cell multiome of the 10X Genomics platform to capture mtDNA, ATAC, and RNA from the same cells. The overall ReDeeM methodology has been described previously6. Here, our focus lies in elucidating the source of all possible mtDNA mutation artifacts within each experimental stage, and demonstrating how ReDeeM is designed to rigorously mitigate these artifacts from diverse origins to achieve high sensitivity and accuracy.
The following major stages in ReDeeM protocol involve possible artifacts on mtDNA (Fig. 1c). Stage 1: mild fixation, permeabilization, and Tn5 tagmentation are performed for cells in tubes. We used 0.1% formaldehyde (FA) for mild fixation. Although the chance of mutagenesis by FA is low given the low concentration and short time (10min), the interaction with FA could induce some single-strand damage that leads to some strand-specific errors. The permeabilization using 0.1% NP40, no reported risk for artifacts. Enzyme-based fragmentation approaches can mitigate the introduction of artifacts compared to methods using sonication and end-repair which causes DNA damage on the edge and leads to errors9. Stage 2: in the droplets (10X Genomics) that encapsulate single cells, the cell-barcodes are ligated onto tagmented mtDNA fragment in ReDeeM (using multiome chemistry), and gap filling is performed after droplet breakdown. Notably, the cell-barcode adapters are double-stranded which add the same barcode to both strands. Both strands can be further amplified. This is an advantage compared to using scATAC chemistry where linear amplification in droplets will only amplify one of the two strands. Together with the Tn5 cutting ends, this provides a robust double-strand UMI system for consensus correction in the downstream analysis. The Tn5 associated 9bp gap filling involves DNA synthesis on one of the two strands of the initial molecule. If polymerase makes any mistakes, it will generate errors on one of the two strands. Stage 3: PCR amplification for library preparation. PCR errors in library prep are common. In ReDeeM protocol, we deviate from the standard 10X Genomics protocol by using high fidelity PCR polymerase of NEBnext and KAPA, which significantly reduces the errors generated during PCR. Stage 4: paired-end sequencing. Sequencing errors are another common source of artifacts. To take the full advantage of overlapping paired-end sequencing, we performed 150X150 paired-end sequencing. The mtDNA fragment by Tn5 is short (mostly around or less than 100bp) due to the lack of histone, and thus ReDeeM protocol can ensure more than 90% of bases are overlapped by both reads.
As described above, ReDeeM implemented both the overlapping paired-end sequencing and the consensus correction. The eUMI used in ReDeeM is a double-strand single-molecule tagging system using double-strand cell barcode with the Tn5 cutting ends, which can correct not only downstream PCR/sequencing errors but also reduce strand-specific artifact in the initial molecule. After sequencing, all reads that share the same eUMI are considered copies from the same original molecule and are grouped for comparison. Here is the breakdown of how ReDeeM mitigates different types of errors (Extended Data Fig. 2b). Most sequencing errors (both stochastic and context-dependent errors) can be removed by comparing the overlapped paired-end read1 and read2. Also, the eUMI consensus filtering can further clean up any remaining sequencing errors that by chance make the same mistakes on both reads; The PCR errors are expected to only appear in a small subset of eUMI group members and thus can be easily filtered out by consensus score. The possible FA induced errors and 9-bp gap filling errors are on one of the two strands in the initial molecule, and thus these errors are expected to show consensus score distribution centered at 50%. By removing mutations with less than 75% consensus score, ReDeeM further reduced this type of error. We also show the majority of mutation calls reach 100% consensus and did not observe significant increase around 50% (Extended Data Fig. 2c, d). Nonetheless, given there is a chance that one strand is not amplified, some of the errors during 9-bp gap filling cannot be removed and thus incorporating a minimal edge trimming is further helpful.
The nuclear genome contains hundreds of NUMTs that are similar to the mtDNA. It is important to control the influence from the germline SNPs on NUMT due to misalignment. ReDeeM offers a number of advantageous features that conceptually and practically minimize the impact of NUMTs. 1) ReDeeM is designed as a multiomics framework that captures open chromatin, mtDNAs and RNA in the same cell. i.e, only the NUMT on accessible nuclear regions have the chance to be captured. We estimate there are only 1 NUMT that could be captured per cell based on the number of accessible peaks and the number of NUMTs (NUMT is approximately 400,000 bp in nuclear genome, that is 0.015% of the human genome. The proportion times the detectable ATAC fragments (~ 10,000/cell) is approximately 1 fragment per cell. Moreover, the NUMT is known to be methylated and largely inactive, and thus the actual number that can be captured from open chromatin can be even lower13. 2) ReDeeM implements a filtering step for alignment where the paired-end reads must both be mapped to mtDNA genome, which effectively removes any remaining NUMTs, because most of NUMTs are short insertions and thus there is a high chance that the NUMT fragment cut by Tn5 would span across the breakpoint and be removed by this filtering step (median NUMT size is 156 bp)13. 3) Since the human nuclear genome is diploid, NUMT germline SNPs have been well modeled and validated as 0.5n/(0.5n+m), where n is nuclear coverage and m is mtDNA. Inspired by this work, ReDeeM requires the mtDNA mutations to have at least two or more than two alleles (molecules) in at least one cell. In fact, more than 75% of mutations we call show 3 or more than 3 alleles in a cell. 4) As discussed above, the overall mutational signature is an effective consensus validation since real mtDNA mutations are enriched in transitions. Notably, the mutational signature of NUMT is different from real mtDNA mutations, and thus their influence is clearly controlled.
ReDeeM filter-2 applies the same consensus filtering strategies and follows the same downstream filtering procedures except the following two changes. (1) After the consensus error filtering, we further label the distance to the nearest fragment end for every mutation, and remove mutations within the distance . We have tested the= 4, 5, 9. We chose = 5 for main analysis which is sufficient to flat the relative position distribution across all samples (d=4 is sufficient in most samples, Extended Data Fig. 5). (2) In the original downstream filtering, a mutation is only included if it is supported by at least two molecules (eUMIs) in at least one cell and can be detected in multiple cells (The max molecule number per cell all cross cells, or max allele ≥ 2 and detected in ≥ 2 cells, as shown in Extended Data Fig. 1). We further refined this hard cutoff with binomial modeling, which follows the same principle. We assume that the residual noise after consensus filtering follows a binomial distribution. By modeling the observed mutation distribution across cells and testing against the expected binomial distribution (chi-squared test). We filter out mutations if there is insufficient evidence to reject the null hypothesis of a binomial distribution (adjusted p > 0.05). This modeling-based method is largely equivalent to max allele>=2 threshold, but it also effectively removes excessive low molecule high connectedness (LMHC) mutations. In this work, we combined the 5 bp trimming and the binomial modeling with adjusted p-value <0.05 as ReDeeM filter-2. The trimming distances and binomial modeling p-values can be further fine-tuned in the ReDeeM-R package for optimization in different systems.