Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged libraries do not show lower-count k-mers #30

Open
bnavarrodominguez opened this issue Dec 18, 2024 · 1 comment
Open

Merged libraries do not show lower-count k-mers #30

bnavarrodominguez opened this issue Dec 18, 2024 · 1 comment

Comments

@bnavarrodominguez
Copy link

Dear Gene,

I have a large sequencing library that I needed to split into 10 smaller files so I could run FastK on different nodes. Following the instructions in the README, I ran FastK on the split files with the following command:

for file in library_split_*; do mkdir tmp.${file}; FastK -v -t5 -k31 -M50 -T24 -Ptmp.${file} $file; done

This produced a *.hist and a *.ktab file for each *.split.fastq file. I looked at the k-mer count histogram for each split file:

Histex -G library_split_01.hist > library_split_01.histogram
$ head library_split_01.histogram
1       6062202409
2       3370987439
3       1728287765
4       894614808
5       482057568

I then merged the split files using Fastmerge, and generated histograms for the merged k-mer database:

Fastmerge -T12 -t -h library_fastmerged library_split_*ktab
Histex -G library_fastmerged.hist > library_fastmerged.histogram
$ head library_fastmerged.histogram

4       2032698049
5       522131235
6       134785342
7       33514609
8       420971175

I noticed that there are no k-mers with a count lower than 4 in the merged library histogram. I repeated the process a few times, combining different files, and the merged histograms consistently lack smaller k-mer counts (i.e., they start at 4 or 5). I’m unsure if this behavior is expected, as I do not understand why there are no single-occurrence k-mers. Is this a bug, or am I misunderstanding or misusing the tool?

Thanks for your assistance!

@KamilSJaron
Copy link

Hi @bnavarrodominguez,

I suspect this will be because the -t option filters out k-mers with coverage under certain threshold, by default it's 1. From README:

One can optionally request, by specifying the ‑t option, that FastK produce a sorted table of all canonical k‑mers along with their counts. If an integer follows then only those k‑mers that occur ‑t or more times where the default threshold is 1. In those applications where low count k‑mers are not needed this can save significant time and space as most such k‑mers are error‑mers. 

So, while the histogram on individual databases will be right (histogram is stored in its entirety), the Fastmerge is not able to merge them, you need the table of k-mer/count pairs, which is affected by -t.

The solution would be to make only a single database (it will not use a lot more memory and compute will scale very reasonably, you will just need more disk space, for such large genome it could be more than a TB, but the compute really should not take all that long and you should be able to free the space afterwards.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants