You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could I please ask for a bit more information on the function of the base value - what it does and what impact higher or lower values will have, and correspondingly how I should decide a setting based on the structure of my dataset?
For extra information, my RNAseq dataset is comprised of multiple tissues from various heterozygous individuals. I intend to conduct a multiple-k approach - should I run ORNA separate times with different values of k on the dataset to generate a set of normalised reads for respective assembly run, or should a single normalisation using the smallest k suffice?
Best wishes,
Reza
The text was updated successfully, but these errors were encountered:
Dear Reza,
the base parameter, sets the logarithm base that is used to determine the minimal number of times a k-mer must be retained in the reduced dataset. For example consider a k-mer that occurs in 8 reads. With base=2, log_2(8)=3, at least 3 reads must be kept. With base=10, log_10(8)=0.9, at least 1 read must be kept. Because every value lower than one is set to 1.
Thus the higher the base the higher is the reduction.
If you have a very large dataset, as it sounds in your case, you can easily go higher with the base parameter. For example, base=3 with 1000 reads would retain 7 reads of them (at least).
Concerning multi-k assemblies. I would recommend to use the same normalised data for all of them. I guess you are thinking that when you were to redo the normalisation with the k-mer parameter used for each of the k-mer assemblies, you ensure that the k-mer connectivity is preserved for each of the k-mer assemblies. But what is also true is that larger k-mer values lead to less reduction if you use the same base parameter. The higher the value for k, the more unique k-mers are in the dataset, thus the more reads get preserved. To speed things up, I would stick with the smaller k-mer value used in the multi-k assembly, assuming of course that this is a reasonable value for your data.
Could I please ask for a bit more information on the function of the base value - what it does and what impact higher or lower values will have, and correspondingly how I should decide a setting based on the structure of my dataset?
For extra information, my RNAseq dataset is comprised of multiple tissues from various heterozygous individuals. I intend to conduct a multiple-k approach - should I run ORNA separate times with different values of k on the dataset to generate a set of normalised reads for respective assembly run, or should a single normalisation using the smallest k suffice?
Best wishes,
Reza
The text was updated successfully, but these errors were encountered: