Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salix RAD-data smudgeplots #122

Closed
KamilSJaron opened this issue Aug 7, 2023 · 6 comments
Closed

Salix RAD-data smudgeplots #122

KamilSJaron opened this issue Aug 7, 2023 · 6 comments

Comments

@KamilSJaron
Copy link
Owner

KamilSJaron commented Aug 7, 2023

Hi, I see that the issue is closed long before. But i need some recommendations regarding my Smudgeplots. Because they are not making sense and I am lost at the moment. So i though it might be helpful to write you all. Please help me!

I have a 120bp single-end RadSeq data (>30x coverage) for a number of Salix species, which I want to determine the ploidy level using the Smudgeplot. I run following commands:

kmc -k21 -t16 -m64 -ci1 -cs10000 "${file}.fastq" "${file}_kmcdb" tmp
    kmc_tools transform "${file}_kmcdb" histogram "${file}_kmcdb_k21.hist" -cx10000
    L=$(smudgeplot.py cutoff "${file}_kmcdb_k21.hist" L)      # Determines automatically; L should be like 20 - 200
    U=$(smudgeplot.py cutoff "${file}_kmcdb_k21.hist" U)     # U should be like 500 - 3000			    
    kmc_tools transform "${file}_kmcdb" -ci"$L" -cx"$U" reduce "${file}_kmcdb_L${L}_U${U}"
    kmc_dump "${file}_kmcdb_L${L}_U${U}" "${file}_kmcdb_L${L}_U${U}_coverages.tsv" "${file}_kmcdb_L${L}_U${U}_pairs.tsv" > "${file}_kmcdb_L${L}_U${U}_familysizes.tsv"
    kmc_tools transform "${file}_kmcdb" -ci"$L" -cx"$U" dump -s "${file}_kmcdb_L${L}_U${U}.dump"
    smudgeplot.py hetkmers -o "${file}_kmcdb_L${L}_U${U}" < "${file}_kmcdb_L${L}_U${U}.dump"
    smudgeplot.py plot "${file}_kmcdb_L${L}_U${U}_coverages.tsv" -o "${file}_kmcdb_L${L}_U${U}_smudgeplot" -t "${file}" -q 0.99

and then got this result for Salix retusa (NW17_076_L10_U520 (x41 coverage) and T2221 (x46) :
S_retusa_NW17_076_kmcdb_L10_U520_smudgeplot_smudgeplot_log10
S_retusa_NW17_076_kmcdb_L10_U520_smudgeplot_smudgeplot

This species/individuals should be an octoploid according to our flow cytometry, but the Smudgeplot suggests triploid or diploid.

I did not give a constant number for L and U, because it showed me an error:
"detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 35
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 30 ...".

When I checked GenomeScope, for both individuals, it failed to converge:
Screen Shot 2023-08-07 at 10 54 51 AM
Screen Shot 2023-08-07 at 10 56 05 AM

I have more than 100 individuals that are not fitting to my expectation of ploidy level.
So my question is, what am I possibly doing wrong?
What could be adjusted to get more precise estimation?

Originally posted by @OyukaKh in #17 (comment)

@KamilSJaron
Copy link
Owner Author

The other reply by @OyukaKh.

Here are the Smudgeplots for S_retusa_T2221 (second example):

S_retusa_T2221_kmcdb_L54_U400_smudgeplot_smudgeplot_log10 S_retusa_T2221_kmcdb_L54_U400_smudgeplot_smudgeplot

@KamilSJaron
Copy link
Owner Author

Hi @OyukaKh,

for new problems we like to have new opened issues, so I moved your messages here.

Nor smudgeplot or genomescope is designed to work on RAD data, sorry it was no on the wiki, I added a section it to FAQ now.

The last sample S_retusa_T2221 looks like it could work, but for some reason the y axis is very stretched and the smudge is not well separated from the error line, is there a chance you could post genomescope of that sample as well?

@KamilSJaron
Copy link
Owner Author

Duplicit to #121

@OyukaKh
Copy link

OyukaKh commented Aug 7, 2023

Thanks for the prompt reply and sad to hear that it does not work (easily) with RadSeq.

But, sure second individual's GenomeScope is here again:

Screen Shot 2023-08-07 at 10 56 05 AM

  • Nevertheless, I have some individuals, where GenomeScope and Smudgeplot both functioned properly!
    (or error rate is too high???, see GenomeScope)

E.g., S_alpina_10620 (diploid) (77x coverage)

S_alp_10620_IT_1_kmcdb_L20_U500_smudgeplot_smudgeplot_log10
S_alp_10620_IT_1_kmcdb_L20_U500_smudgeplot_smudgeplot

Screen Shot 2023-08-07 at 4 47 16 PM

You wrote: "In principle with removed optical duplicates and with careful consideration of the data one could use the same principles to determine ploidy",

Could you suggest how to do it? How can I remove duplicates optically?

@KamilSJaron
Copy link
Owner Author

I am by no means expert on RADseq, and as a matter of fact, I foolishly wrote optical duplicates, instead of just "duplicates", because in RAD most of them will be PCR duplicates. Meaning, it's the same template DNA that was sequenced twice, those should not be weighted as the two independent observation of the same genomic location and they do contribute to coverage variation.

Is it ddRAD or sdRAD? I think for ddRAD it's quite complicated (if not impossible), but for the single end, you can remove all the fragments of the exactly same size (you expect the trailing end will vary in size). I really don't know what tools people use for this (I know it from talks only).

But most of all, I would really try some reference approach. There is at least one chromosomal willow out there (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_029030765.1/). I think you would have easier times looking at aligned piles or reads than k-mers. The gneome is not too large anyway, so hopefully it will easy to align too.

@OyukaKh
Copy link

OyukaKh commented Aug 8, 2023

I see! Thanks a lot for the suggestion and prompt replies!
We have sdRAd data. So, we would go for aligning :-).
Thanks Kamil!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants