Genome assembly twice the expected size #740

vaillan6 · 2024-12-10T23:46:26Z

Hello,
Thank you in advance for any input anyone may have on this issue. I am assembling a diploid plant species and my assembly is twice the estimated size, highly fragmented, small N50, and high duplication rate. I have run numerous hifiasm assemblies (30+) and have not previously come across this issue and am struggling to figure out what is going on with this species. Please note results are the same with hifiasm version 0.19.9-r616, 0.20.0-r639, and 0.23.0-r691. I have also adjusted the homozygous read coverage, -s, varying hifi read length cutoffs (all, 5kb, 10kb, 15kb, 20kb, 25kb), varying kmer sizes, and with/without HiC data. The estimated genome size is 1.1 Gb via flow cytometry. This species has also been assembled twice confirming this estimate.

Example assembly statistics (all are quite similar to this), hifiasm v 0.23.0:

haplotype	coverage	homozygous read coverage threshold	Assembly size	Number of contigs	N50	BUSCO
1	59	53	2,934,151,061	8,691	1,034,973
2	59	53	2,533,205,589	10,879	578,920
primary	59	53	2,931,051,231	4,103	2,515,704	C:98.6%[S:5.0%,D:93.6%],F:1.1%,M:0.3%,n:1614,E:2.4%

The kmer plot is not something I have seen before either:

I have run numerous contamination checks (kraken), genomescope, pandepth (HiFi reads vs. published assembly), smudgeplot, etc. if any of those results may be helpful.

Genomescope of HiFi reads, kmer of 51

Thank you for any help/tips anyone may be able to provide.

HuiyangYu · 2024-12-12T13:22:49Z

Based on the information you provided and the three images, it appears that your HiFi reads contain a significant amount of low-quality reads. You should perform a re-analysis of the Genomescope using the k-mer frequencies of Illumina short reads to validate this hypothesis. If the hypothesis is correct, you can filter your HiFi reads by using Filtlong in combination with the short reads. Finally, you can reassemble the filtered reads using hifiasm.

dnawhisperer · 2025-01-27T23:57:15Z

From my experience with plants, this is likely a polyploid. Double genome size and BUSCO of 5% single and 94% duplicated is an indication; you have a tetraploid and its being output as diploid. Most likely an autotetraploid as you cant see multiple peaks in the genomescope plot. But that is also probably because its low coverage per haplotype for Genomescope (and smudgplot) and that peak of 'errors' are true unique kmers. These are ideally for high coverage Illumina

Try --n-hap 4 then run busco

I could be wrong; if the --hom-cov is incorrect (too low), it outputs too much of the same seq. Check how you calculated that, given GenomeScope could be giving unreliable rusults here.

Alternatively, a disaster has happened in the lab and leaves of two specimens/species have been mixed together.

All the best

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genome assembly twice the expected size #740

Genome assembly twice the expected size #740

vaillan6 commented Dec 10, 2024

HuiyangYu commented Dec 12, 2024

dnawhisperer commented Jan 27, 2025 •

edited

Loading

Genome assembly twice the expected size #740

Genome assembly twice the expected size #740

Comments

vaillan6 commented Dec 10, 2024

HuiyangYu commented Dec 12, 2024

dnawhisperer commented Jan 27, 2025 • edited Loading

dnawhisperer commented Jan 27, 2025 •

edited

Loading