Skip to content
Kamil S Jaroň edited this page May 20, 2020 · 6 revisions

Imagine a tetraploid genome. Every biallelic heterozygous locus will have two possible genotypes. Either Three haplocopies are the same and the last one has a variant (what I call AAAB), or two pairs of haplocopies (AABB). If two of the haplotypes have undergone a deletion, the remaining two haplotypes can carry a couple of heterozygous loci in a diploid state (AB). If the deletion is polymorphic in the population, it might happen that although the genome is tetraploid several heterozygous loci are triploid-like (AAB). Furthermore a similar logic can be applied for local (gene) duplications that will lead to more than tetraploid loci. The goal of smudgeplot is to visualize the genome structure by the quantification of the relative frequencies of the hyplotype structures described above.

In smudgeplot we extract all the genomic kmers found in the read set that are the same in all but one nucleotide and form a unique pair (i.e. good biallelic loci candidates, pairs of kmers an SNP apart from each other). Then in our tetraploid example, the sum of coverages of most of these kmer pairs will be ~4n (loci in AAAB and AABB states), but we will certainly find also some ~3n for AAB and 2n for AB, where n is the sequencing coverage of every haplotype in the genome (sometimes referred to as haploid coverage). The second clue to understand genome structure is a coverage ratio between A and B. For AABB and AB the ratio will be ~0.5, for AAB ~0.33 and for AAAB 0.25. So we know that every heterozygosity structure will have a unique mean in the two dimensional distribution (smudgeplot) and by quantifying the intensities (number of kmer pairs) associated with each smudge we can guess the haplotype structure.

Smudgeplot is nothing else than a plot of coverage sums vs coverage ratios of the "heterozygous" kmer pairs. I have quoted the word heterozygous because these kmer pairs might also represent a pair of paralogs or even a paired genomic kmers with a kmer generated by a sequencing error (see chosing L and U for details about this).

This is the general idea, for individual example(s) of smudgeplots and their interpretations see the tutorials or check out the frist talk of this webinar where the idea behind genome profiling and smudgeplot in particular is explained.