Merged libraries do not show lower-count k-mers #30

bnavarrodominguez · 2024-12-18T16:48:47Z

Dear Gene,

I have a large sequencing library that I needed to split into 10 smaller files so I could run FastK on different nodes. Following the instructions in the README, I ran FastK on the split files with the following command:

for file in library_split_*; do mkdir tmp.${file}; FastK -v -t5 -k31 -M50 -T24 -Ptmp.${file} $file; done

This produced a *.hist and a *.ktab file for each *.split.fastq file. I looked at the k-mer count histogram for each split file:

Histex -G library_split_01.hist > library_split_01.histogram

$ head library_split_01.histogram
1       6062202409
2       3370987439
3       1728287765
4       894614808
5       482057568

I then merged the split files using Fastmerge, and generated histograms for the merged k-mer database:

Fastmerge -T12 -t -h library_fastmerged library_split_*ktab
Histex -G library_fastmerged.hist > library_fastmerged.histogram

$ head library_fastmerged.histogram

4       2032698049
5       522131235
6       134785342
7       33514609
8       420971175

I noticed that there are no k-mers with a count lower than 4 in the merged library histogram. I repeated the process a few times, combining different files, and the merged histograms consistently lack smaller k-mer counts (i.e., they start at 4 or 5). I’m unsure if this behavior is expected, as I do not understand why there are no single-occurrence k-mers. Is this a bug, or am I misunderstanding or misusing the tool?

Thanks for your assistance!

The text was updated successfully, but these errors were encountered:

KamilSJaron · 2025-01-20T09:29:32Z

Hi @bnavarrodominguez,

I suspect this will be because the -t option filters out k-mers with coverage under certain threshold, by default it's 1. From README:

One can optionally request, by specifying the ‑t option, that FastK produce a sorted table of all canonical k‑mers along with their counts. If an integer follows then only those k‑mers that occur ‑t or more times where the default threshold is 1. In those applications where low count k‑mers are not needed this can save significant time and space as most such k‑mers are error‑mers.

So, while the histogram on individual databases will be right (histogram is stored in its entirety), the Fastmerge is not able to merge them, you need the table of k-mer/count pairs, which is affected by -t.

The solution would be to make only a single database (it will not use a lot more memory and compute will scale very reasonably, you will just need more disk space, for such large genome it could be more than a TB, but the compute really should not take all that long and you should be able to free the space afterwards.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merged libraries do not show lower-count k-mers #30

Merged libraries do not show lower-count k-mers #30

bnavarrodominguez commented Dec 18, 2024

KamilSJaron commented Jan 20, 2025

Merged libraries do not show lower-count k-mers #30

Merged libraries do not show lower-count k-mers #30

Comments

bnavarrodominguez commented Dec 18, 2024

KamilSJaron commented Jan 20, 2025