Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome assembly twice the expected size #740

Open
vaillan6 opened this issue Dec 10, 2024 · 2 comments
Open

Genome assembly twice the expected size #740

vaillan6 opened this issue Dec 10, 2024 · 2 comments

Comments

@vaillan6
Copy link

Hello,
Thank you in advance for any input anyone may have on this issue. I am assembling a diploid plant species and my assembly is twice the estimated size, highly fragmented, small N50, and high duplication rate. I have run numerous hifiasm assemblies (30+) and have not previously come across this issue and am struggling to figure out what is going on with this species. Please note results are the same with hifiasm version 0.19.9-r616, 0.20.0-r639, and 0.23.0-r691. I have also adjusted the homozygous read coverage, -s, varying hifi read length cutoffs (all, 5kb, 10kb, 15kb, 20kb, 25kb), varying kmer sizes, and with/without HiC data. The estimated genome size is 1.1 Gb via flow cytometry. This species has also been assembled twice confirming this estimate.

Example assembly statistics (all are quite similar to this), hifiasm v 0.23.0:

haplotype coverage homozygous read coverage threshold Assembly size Number of contigs N50 BUSCO
1 59 53 2,934,151,061 8,691 1,034,973  
2 59 53 2,533,205,589 10,879 578,920  
primary 59 53 2,931,051,231 4,103 2,515,704 C:98.6%[S:5.0%,D:93.6%],F:1.1%,M:0.3%,n:1614,E:2.4%

The kmer plot is not something I have seen before either:
image

image

I have run numerous contamination checks (kraken), genomescope, pandepth (HiFi reads vs. published assembly), smudgeplot, etc. if any of those results may be helpful.

Genomescope of HiFi reads, kmer of 51
image
image

Thank you for any help/tips anyone may be able to provide.

@HuiyangYu
Copy link

Based on the information you provided and the three images, it appears that your HiFi reads contain a significant amount of low-quality reads. You should perform a re-analysis of the Genomescope using the k-mer frequencies of Illumina short reads to validate this hypothesis. If the hypothesis is correct, you can filter your HiFi reads by using Filtlong in combination with the short reads. Finally, you can reassemble the filtered reads using hifiasm.

@dnawhisperer
Copy link

dnawhisperer commented Jan 27, 2025

From my experience with plants, this is likely a polyploid. Double genome size and BUSCO of 5% single and 94% duplicated is an indication; you have a tetraploid and its being output as diploid. Most likely an autotetraploid as you cant see multiple peaks in the genomescope plot. But that is also probably because its low coverage per haplotype for Genomescope (and smudgplot) and that peak of 'errors' are true unique kmers. These are ideally for high coverage Illumina

Try --n-hap 4 then run busco

I could be wrong; if the --hom-cov is incorrect (too low), it outputs too much of the same seq. Check how you calculated that, given GenomeScope could be giving unreliable rusults here.

Alternatively, a disaster has happened in the lab and leaves of two specimens/species have been mixed together.

All the best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants