-
Notifications
You must be signed in to change notification settings - Fork 24
smudgeplot hetkmers
This is an algorithm that extracts kmer pairs from kmer dump. Although the algorithm does not take any parameters, it's important to realize that the output is heavily dependent on choice of L and U when the kmer dump was generated, look at wikipage chosing L and U for details.
usage: smudgeplot hetkmers [-h] [-o O] [--middle] [infile]
Calculate unique kmer pairs from a Jellyfish or KMC dump file.
positional arguments:
infile Alphabetically sorted Jellyfish or KMC dump file (stdin).
optional arguments:
-h, --help show this help message and exit
-o O The pattern used to name the output (kmerpairs).
--middle Get all kmer pairs that are exactly the same but in the middle nt. When this flag is used, the input dump must be alphabetically sorted/ (default: different by SNP at any position).
The memory required scale linearly with the number of kmers and it is approximately 15x higher than the size of the dump file (for a 20Gb dump file you will need approx ~250Gb of RAM). Alternatively, you can estimate the RAM requirement by number of dumped kmers. It's approximately 350x higher than number of kmers in the dump file. If your file has too many kmers you can decrease computational requirement by rerunning the kmer spectra with a smaller k (i.e. kmer size) or by more strict filtering of the dumped kmers (higher L and smaller U).
We have not calculated the complexity of the algorithm yet. Usually for smaller genomes (<250Gb) it's couple of hours, the longest computation took bit more than one day.
The biggest genome we analyzed so far was a triplod genome with a haploid size 3.5Gbp. We have processed 1.5e9 genomic kmers and it have required 520GB of memory and two days of computation on eight cores.
The default algorithm extracts kmers pairs different by any nt, which mean that a heterozygous SNP might get into k
kmer pairs. Here we simplify algorithm to extract only the kmer pairs that are different by the middle nucleotide. This means that we will miss all the heterozygous loci where the SNPs are closer than k / 2
nts from each other. Long story short, it seems that it moreless works, the preliminary results were very promising, but many species that were very heterozygous or just strange the middle kmer algorithm did not work, including our awesome strawberry example. Therefore I would say, everytime you have the resources to extract kmer pair with the default algorithm, do so.
This algorithm was introduced as a way how to speed up the extraction process and reduce the memory requirements to a moreless constant (it's less than 10G). And in this sense we works very well, we managed to process ~8.5Gbp genome in just a few hours. So if you feel like experimenting, you can fiddle with --middle
.
One really important note, you need to have the kmer dump on input alphabetically sorted (kmc has not problem in generate the dump sorted).