Skip to content

Latest commit

 

History

History
135 lines (108 loc) · 20.9 KB

03.results.md

File metadata and controls

135 lines (108 loc) · 20.9 KB

Results

A novel method for detecting mutator alleles

We developed a statistical method, termed "aggregate mutation spectrum distance" (AMSD), to detect loci that are associated with mutation spectrum variation in recombinant inbred lines (RILs) (Figure {@fig:distance-method}; Materials and Methods). Our approach leverages the fact that mutator alleles often leave behind distinct and detectable impressions on the mutation spectrum, even if they increase the overall mutation rate by a relatively small amount. Given a population of haplotypes, we assume that each has been genotyped at the same collection of biallelic loci and that each harbors de novo mutations which have been partitioned by $k$-mer context (Figure @fig:distance-method). At every locus, we calculate a cosine distance between the aggregate mutation spectra of haplotypes that inherited either parental allele. Using permutation tests, we then identify loci at which those distances are larger than what we'd expect by random chance. To account for polygenic effects on the mutation process that might be shared between BXDs, we also regress the cosine distance at each marker against the genetic similarity between haplotype groups, and assess significance using the fitted residuals (which we call the "adjusted" cosine distances) (Materials and Methods).

Using simulated data, we find that our method's power is primarily limited by the initial mutation rate of the $k$-mer mutation type affected by a mutator allele and the total number of de novo mutations used to detect it (Figure {@fig:simulations}). Given 100 haplotypes with an average of 500 de novo germline mutations each, AMSD has approximately 90% power to detect a mutator allele that increases the C>A de novo mutation rate by as little as 20%. However, the approach has less than 20% power to detect a mutator of identical effect size that augments the C>G mutation rate, since C>G mutations are expected to make up a smaller fraction of all de novo germline mutations to begin with. Simulations also demonstrate that our approach is well-powered to detect large-effect mutator alleles (e.g., those that increase the mutation rate of a specific $k$-mer by 50%), even with a relatively small number of mutations per haplotype (Figure {@fig:simulations}). Both AMSD and traditional quantitative trait locus (QTL) mapping have similar power to detect alleles that augment the rates of individual 1-mer mutation types (Figure {@fig:ihd_vs_qtl_power}), but AMSD has a number of potential advantages for mutator allele discovery. For example, we find that AMSD is better-powered than QTL mapping when the number of simulated de novo mutations is allowed to vary (by a factor of 20) across haplotypes (Figure {@fig:ihd_vs_qtl_power_variable_counts}) and when mutator allele frequencies are less than 50% (Figure {@fig:ihd_vs_qtl_power_variable_afs}). However, we also caution that many of the parameters used in our simulations are specific to the BXD mice (e.g., numbers of haplotypes, average numbers of mutations, expected allele frequencies at markers), and do not necessarily reflect the power of AMSD on other populations.

 Overview of aggregate mutation spectrum distance method for discovering mutator alleles. a) A population of four haplotypes has been genotyped at three informative markers ($g_1$ through $g_3$); each haplotype also harbors unique de novo germline mutations. In practice, de novo mutations are partitioned by $k$-mer context; for simplicity in this toy example, de novo mutations are simply classified into two possible mutation types (grey squares represent C>(A/T/G) mutations, while grey triangles represent A>(C/T/G) mutations). b) At each informative marker $g_n$, we calculate the total number of each mutation type observed on haplotypes that carry either parental allele (i.e., the aggregate mutation spectrum) using all genome-wide de novo mutations. For example, haplotypes with A (orange) genotypes at $g_1$ carry a total of three "triangle" mutations and five "square" mutations, and haplotypes with B (green) genotypes carry a total of six triangle and two square mutations. We then calculate the cosine distance between the two aggregate mutation spectra, which we call the "aggregate mutation spectrum distance." Cosine distance can be defined as $1 - \cos(\theta)$, where $\theta$ is the angle between two vectors; in this case, the two vectors are the two aggregate spectra. We repeat this process for every informative marker $g_n$. c) To assess the significance of any distance peaks in b), we perform permutation tests. In each of $N$ permutations, we shuffle the haplotype labels associated with the de novo mutation data, run a genome-wide distance scan, and record the maximum cosine distance encountered at any locus in the scan. Finally, we calculate the $1 - p$ percentile of the distribution of those maximum distances to obtain a genome-wide cosine distance threshold at the specified value of $p$. {#fig:distance-method width=7.5in}

Re-identifying a mutator allele on chromosome 4 in the BXDs

We applied our aggregate mutation spectrum distance method to 117 BXDs (Materials and Methods) with a total of 65,552 de novo germline mutations [@PMID:35545679]. Using mutation data that were partitioned by 1-mer nucleotide context, we discovered a locus on chromosome 4 that was significantly associated with mutation spectrum variation (Figure {@fig:distance-results}a; maximum adjusted cosine distance of 1.20e-2 at marker ID rs27509845; position 118.28 Mbp in GRCm38/mm10 coordinates; 90% bootstrap confidence interval from 114.79 - 118.75 Mbp).

 Results of aggregate mutation spectrum distance scans in the BXDs. a) Adjusted cosine distances between aggregate 1-mer de novo mutation spectra on BXD haplotypes (n = 117 haplotypes; 65,552 total mutations) with either D or B alleles at 7,128 informative markers. Cosine distance threshold at p = 0.05 was calculated by performing 10,000 permutations of the BXD mutation data, and is shown as a dotted grey line. b) Adjusted cosine distances between aggregate 1-mer de novo mutation spectra on BXD haplotypes with D alleles at rs27509845 (n = 66 haplotypes; 42,171 total mutations) and either D or B alleles at 7,063 informative markers. Cosine distance threshold at p = 0.05 was calculated by performing 10,000 permutations of the BXD mutation data, and is shown as a dotted grey line. c) Adjusted cosine distances between aggregate 1-mer de novo mutation spectra on BXD haplotypes with B alleles at rs27509845 (n = 44 haplotypes; 22,645 total mutations) and either D or B alleles at 7,063 informative markers. Cosine distance threshold at p = 0.05 was calculated by performing 10,000 permutations of the BXD mutation data, and is shown as a dotted grey line. {#fig:distance-results width=7.5in}

Using quantitative trait locus (QTL) mapping, we previously identified a nearly-identical locus on chromosome 4 that was significantly associated with the C>A germline mutation rate in the BXDs [@PMID:35545679]. This locus overlapped 21 protein-coding genes that were annotated by the Gene Ontology as being involved in "DNA repair," but only one of those genes contained nonsynonymous differences between the two parental strains: Mutyh. Mutyh encodes a protein involved in the base-excision repair of 8-oxoguanine (8-oxoG), a DNA lesion caused by oxidative damage, and prevents the accumulation of C>A mutations [@PMID:28551381;@PMID:28127763;@PMID:17581577]. C>A germline mutation fractions are nearly 50% higher in BXDs that inherit D genotypes at marker ID rs27509845 (the marker at which we observed the highest adjusted cosine distance on chromosome 4) than in those that inherit B genotypes (Figure @fig:spectra-comparison) [@PMID:35545679].

An additional germline mutator allele on chromosome 6

After confirming that AMSD could recover the mutator locus overlapping Mutyh, we tested its ability to identify additional mutator loci in the BXDs. To eliminate potential confounding of the mutation spectrum landscape by the large-effect mutator locus on chromosome 4, we performed AMSD scans that were conditional on the presence of either D or B alleles at rs27509845. We also hypothesized that such conditioning might reveal epistatic interactions between alleles at the chromosome 4 locus and mutator alleles elsewhere in the genome. Specifically, we divided the BXDs into those with either D (n = 66) or B (n = 44) genotypes at rs27509845 (n = 7 BXDs were heterozygous) and ran an aggregate mutation spectrum distance scan using each group separately (Figure {@fig:distance-results}b-c). We excluded the BXD68 RIL from these scans, since we previously found that BXD68 harbors a strain-private C>A mutator allele of even larger effect [@PMID:35545679].

Using the BXDs with D genotypes at rs27509845, we identified a locus on chromosome 6 that was significantly associated with mutation spectrum variation (Figure {@fig:distance-results}b; maximum adjusted cosine distance of 3.69e-3 at marker rs46276051; position 111.27 Mbp in GRCm38/mm10 coordinates; 90% bootstrap confidence interval from 95.01 - 114.02 Mbp). This signal was specific to BXDs with D genotypes at the rs27509845 locus, as we did not observe any new mutator loci after performing an AMSD scan using BXDs with B genotypes at rs27509845 (Figure {@fig:distance-results}c). The peak markers on chromosome 4 and 6 did not exhibit strong linkage disequilibrium ($R^2$ = 4e-5). We also performed QTL scans for the fractions of each 1-mer mutation type using the same mutation data, but none produced a genome-wide significant log-odds score at any locus (Figure {@fig:qtl-scans}; Materials and Methods).

We queried the region surrounding the top marker on chromosome 6 (+/- the 90% bootstrap confidence interval) and discovered 64 protein-coding genes, of which four were annotated with a Gene Ontology (GO) [@PMID:10802651;@PMID:33290552] term related to "DNA repair": Fancd2, Ogg1, Setmar, and Rad18. None of the remaining genes were annotated with a cellular function that would obviously contribute to a germline mutator phenotype; however, many of these GO annotations are imperfect and/or incomplete. Although we focus our analysis on DNA repair genes, it remains possible that other genes within the confidence interval could underlie the C>A mutator phenotype we identified in the BXDs.

Of the annotated DNA repair genes within the confidence interval, two harbored nonsynonymous differences between the parental C57BL/6J and DBA/2J strains (Table @tbl:nonsyn-diffs). Ogg1 encodes a key member of the base-excision repair response to oxidative DNA damage (a pathway that also includes Mutyh), and in mice Setmar encodes a SET domain-containing histone methyltransferase; both Ogg1 and Setmar are expressed in mouse gonadal cells. Because the bootstrap can exhibit poor coverage in QTL mapping studies [@PMID:16783000], we also scanned an interval +/- 5 Mbp from the peak AMSD marker on chromosome 6 for additional candidate genes. Although the choice of a 10 Mbp interval is somewhat arbitrary, the interval does contain a plausible candidate: Mbd4, a protein-coding gene involved in base excision repair that also harbors a non-synonymous difference between the BXD parental strains (Table @tbl:nonsyn-diffs).

Gene name Ensembl transcript name Nucleotide change Amino acid change Position in GRCm38/mm10 coordinates PhyloP conservation score SIFT prediction
Setmar ENSMUST00000049246 C>T p.Leu103Phe chr6:108,075,853 0.422 0.0 (intolerant/deleterious)
Setmar ENSMUST00000049246 T>G p.Ser273Arg chr6:108,076,365 -0.355 0.3 (tolerant/benign)
Ogg1 ENSMUST00000032406 A>G p.Thr95Ala chr6:113,328,510 -0.016 0.84 (tolerant/benign)
Mbd4 ENSMUST00000032469 C>T p.Asp129Asn chr6:115,849,644 2.28 0.02 (intolerant/deleterious)

Table: Nonsynonymous mutations in DNA repair genes near the chr6 peak {#tbl:nonsyn-diffs}

We also considered the possibility that expression quantitative trait loci (eQTLs), rather than nonsynonymous mutations, could contribute to the C>A mutator phenotype associated with the locus on chromosome 6. Using GeneNetwork [@PMID:27933521] we mapped eQTLs for the five aforementioned DNA repair genes (as well as Mbd4) in a number of tissues, though we did not have access to expression data from germline cells. Notably, D alleles near the cosine distance peak on chromosome 6 were significantly associated with decreased Ogg1 expression in kidney, liver, hippocampus, and gastrointestinal tissues (Table @tbl:eqtl-results). Although these cis-eQTLs are challenging to interpret (given their tissue specificity and our lack of access to germline expression data), the presence of strong-effect cis-eQTLs for Ogg1 suggests that the C>A mutator phenotype observed in the BXDs may be mediated by regulatory, rather than protein-altering, variants.

Finally, we queried a dataset of structural variants (SVs) identified via high-quality, long-read assembly of inbred laboratory mouse strains [@doi:10.1016/j.xgen.2023.100291] and found 176 large insertions or deletions (>100 bp) within the 90% bootstrap confidence interval around the cosine distance peak on chromosome 6; none overlapped the exonic sequences of protein-coding genes.

One protein-coding gene involved in DNA repair (Rad18) harbored an intronic deletion within the interval on chromosome 6 (chr6:112,629,618-112,636,619); however, additional experimental evidence will be needed to probe the functional impact of this structural variant.

Evidence of epistasis between germline mutator alleles

Next, we more precisely characterized the effects of the chromosome 4 and 6 mutator alleles on mutation spectra in the BXDs. To pinpoint the mutation type(s) that underlied the significant cosine distance peak on chromosome 6, we compared the aggregate counts of each 1-mer mutation type (plus CpG>TpG) on BXD haplotypes with D genotypes at rs27509845 and either D or B genotypes at rs46276051. We found that C>A mutations were significantly enriched on BXD haplotypes with D genotypes at the chromosome 6 mutator locus, relative to those with B genotypes ($\chi^2$ statistic = 85.36, p = 2.48e-20). On average, C>A germline mutation fractions were significantly higher in BXDs with D alleles at both mutator loci than in BXDs with D alleles at either locus alone (Figure {@fig:spectra-comparison}a and @fig:spectra-comparison-all). Among BXDs with B alleles at the locus overlapping Mutyh, those with D alleles on chromosome 6 did not exhibit significantly elevated C>A mutation fractions (Figure {@fig:spectra-comparison}a). After controlling for inbreeding duration, we observed that C>A de novo mutation counts were always highest in BXDs with D alleles at both mutator loci (Figure {@fig:spectra-comparison}b). After 100 generations of inbreeding, BXDs with D alleles at both mutator loci were predicted to have 238.9 C>A mutations (95% CI: 231.4 - 246.4), about 20% more than the 199.0 mutations (95% CI: 193.3 - 204.7) predicted in those with D and B alleles at the chromosome 4 and chromosome 6 loci, respectively (Figure {@fig:spectra-comparison}b).

We also used SigProfilerExtractor [@PMID:36388765] to assign the germline mutations in each BXD to single-base substitution (SBS) mutation signatures from the COSMIC catalog [@PMID:30371878]. Mutation signatures often reflect specific exogenous or endogenous sources of DNA damage, and the proportions of mutations attributable to particular SBS signatures can suggest a genetic or environmental etiology. The SBS1, SBS5, and SBS30 mutation signatures were active in nearly all BXDs, regardless of genotypes at the chromosome 4 and 6 mutator loci (Figure {@fig:spectra-comparison}c). However, the SBS18 signature, which is dominated by C>A mutations and likely reflects unrepaired DNA damage from reactive oxygen species, was almost exclusively active in mice with D alleles at the chromosome 4 locus; the highest SBS18 activity was observed in mice with D alleles at both mutator loci (Figure {@fig:spectra-comparison}c). SBS18 activity was lowest in mice with D alleles at the chromosome 6 mutator locus alone (Figure {@fig:spectra-comparison}c), further demonstrating that D alleles at this locus are not sufficient to cause a mutator phenotype.

To more formally test for statistical epistasis, we fit a generalized (Poisson) linear model predicting counts of C>A mutations in each BXD as a function of genotypes at rs27509845 and rs46276051 (the markers with the largest adjusted cosine distance at the two mutator loci); the model also accounted for differences in inbreeding duration and sequencing coverage between the BXDs (Materials and Methods). A model that included an interaction term between genotypes at the two markers fit the data significantly better than a model including only additive effects (p = 7.92e-7; Materials and Methods), indicating that the combined effects of D genotypes at both loci exceeded the sum of marginal effects of D genotypes at either locus alone.

 BXD mutation spectra are affected by alleles at both mutator loci. a) C>A de novo germline mutation fractions in BXDs with either D or B genotypes at markers rs27509845 (chr4 peak) and rs46276051 (chr6 peak). Distributions of C>A mutation fractions were compared with two-sided Mann-Whitney U-tests; annotated p-values are uncorrected. B-B vs. B-D comparison: U-statistic = 149.0, p = 7.58e-2; B-D vs D-D comparison: U-statistic = 21.0, p = 2.61e-8; D-B vs D-D comparison: U-statistic = 232.5, p = 6.99e-5. b) The count of C>A de novo germline mutations in each BXD was plotted against the number of generations for which it was inbred. Lines represent predicted C>A counts in each haplotype group from a generalized linear model (Poisson family, identity link), and shading around each line represents the 95% confidence interval. c) Germline mutations in each BXD were assigned to COSMIC SBS mutation signatures using SigProfilerExtractor [@PMID:36388765]. After grouping BXDs by their genotypes at rs27509845 and rs46276051, we calculated the fraction of mutations in each group that was attributed to each signature. The proposed etiologies of each mutation signature are: SBS1 (spontaneous deamination of methylated cytosine nucleotides at CpG contexts), SBS5 (unknown, clock-like signature), SBS18 (damage by reactive oxygen species, related to SBS36 and defective base-excision repair due to loss-of-function mutations in MUTYH), and SBS30 (defective base-excision repair due to NTHL1 mutations). {#fig:spectra-comparison width=7.5in}

To explore the effects of the two mutator loci in other inbred laboratory mice, we also compared the germline mutation spectra of Sanger Mouse Genomes Project (MGP) strains [@PMID:21921910]. Dumont [@PMID:30753674] previously identified germline mutations that were private to each of the 29 MGP strains; these private variants likely represent recent de novo mutations (Figure {@fig:spectra-comparison-mgp}). Only two of the MGP strains possess D genotypes at both the chromosome 4 and chromosome 6 mutator loci: DBA/1J and DBA/2J. As before, we tested for epistasis in the MGP strains by fitting two linear models predicting C>A mutation counts as a function of genotypes at the two mutator loci. A model incorporating an interaction term did not fit the MGP data significantly better than a model with additive effects alone (p = 0.806), so we are unable to confirm the signal of epistasis; however, this may be due to the smaller number of MGP strains with de novo germline mutation data.

Some of the candidate mutator alleles are segregating in wild mice

To determine whether the candidate mutator alleles on chromosome 6 were segregating in natural populations, we queried previously published sequencing data generated from 67 wild-derived mice [@PMID:27622383]. These data include three subspecies of Mus musculus, as well as the outgroup Mus spretus. We found that the Ogg1 D allele was segregating at an allele frequency of 0.259 in Mus musculus domesticus, the species from which C57BL/6J and DBA/2J derive the majority of their genomes [@PMID:17660819], and was fixed in Mus musculus musculus, Mus musculus castaneus, and the outgroup Mus spretus (Figure @fig:wild-afs). The Setmar p.Ser273Arg D allele was also present at an allele frequency of 0.37 in Mus musculus domesticus, while D alleles at the Setmar p.Leu103Phe variant were not observed in any wild Mus musculus domesticus animals. D alleles at the Mbd4 p.Asp129Asn variant were also absent from all wild mouse populations (Figure @fig:wild-afs).