Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Information on logrithm base value #2

Closed
minor7b5 opened this issue Jan 11, 2019 · 1 comment
Closed

Information on logrithm base value #2

minor7b5 opened this issue Jan 11, 2019 · 1 comment

Comments

@minor7b5
Copy link

minor7b5 commented Jan 11, 2019

Could I please ask for a bit more information on the function of the base value - what it does and what impact higher or lower values will have, and correspondingly how I should decide a setting based on the structure of my dataset?

For extra information, my RNAseq dataset is comprised of multiple tissues from various heterozygous individuals. I intend to conduct a multiple-k approach - should I run ORNA separate times with different values of k on the dataset to generate a set of normalised reads for respective assembly run, or should a single normalisation using the smallest k suffice?

Best wishes,
Reza

@SchulzLab
Copy link
Owner

Dear Reza,
the base parameter, sets the logarithm base that is used to determine the minimal number of times a k-mer must be retained in the reduced dataset. For example consider a k-mer that occurs in 8 reads. With base=2, log_2(8)=3, at least 3 reads must be kept. With base=10, log_10(8)=0.9, at least 1 read must be kept. Because every value lower than one is set to 1.
Thus the higher the base the higher is the reduction.

If you have a very large dataset, as it sounds in your case, you can easily go higher with the base parameter. For example, base=3 with 1000 reads would retain 7 reads of them (at least).
Concerning multi-k assemblies. I would recommend to use the same normalised data for all of them. I guess you are thinking that when you were to redo the normalisation with the k-mer parameter used for each of the k-mer assemblies, you ensure that the k-mer connectivity is preserved for each of the k-mer assemblies. But what is also true is that larger k-mer values lead to less reduction if you use the same base parameter. The higher the value for k, the more unique k-mers are in the dataset, thus the more reads get preserved. To speed things up, I would stick with the smaller k-mer value used in the multi-k assembly, assuming of course that this is a reasonable value for your data.

Hope that helps,
Marcel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants