-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about when to use CLR or ILR #11
Comments
Hey Josh! Thank you for the kind words. Yes that podcast was some time ago; it was good fun to do, though I've learned a lot since then and sometimes wish I could cut a second take ^_^. But anyway, here is a very warm welcome to the exciting world of CoDa! These are all good questions, and I'll try to do my best to answer them here. (1) The IQLR is a construct introduced by ALDEx2. I think it's a very cool idea, although it's not, strictly speaking, an orthodox CoDa method. The IQLR attempts to normalize the data, not unlike an effective library size normalization used by edgeR/DESeq2 (we discuss the similarities between CLR/IQLR and ELSN in the supplement of our Bioinformatics review). Like ELSN, a single reference is computed for the whole data set. So I don't think it would be considered "sample agnostic" as you say. Below is my implementation of it (as used in the propr package), which may make it more clear how exactly it works.
(2) I would be inclined to apply CLR to each sample as produced from a single experiment. You can then fraction off the features into sub-groups after the fact, e.g., by clustering the features. That said, if your features are related by a phylogenic tree, you could first add similar features together before transforming the data. For example, when I analyze 16s data, I will CLR transform all species and do an analysis. Meanwhile, separately, I will add the species-level features to get genus-level features, then CLR transform the genera and do an analysis. Adding before the CLR is called amalgamation. (Note that adding after is actually multiplication because of the log-transform.) One exception here is if you have multi-omics data, in which case I strongly recommend you apply the CLR to each data modality separately before integrating the analysis. FYI we've recently written specifically about multi-omics data integration here if that's your cup of tea https://www.biorxiv.org/content/10.1101/847475v1 (3) Very good question. In both cases, you do end up with the same NxN distance matrix. I'm not enough of a pure mathematician to speak as why they are the same. However, the Aitchison distance is also equivalent to the total distance between all pairwise log-ratios. Apparently, it's possible to refactor each form (i.e., the pairwise LR distance, the CLR distance, and the ILR distance) to be equivalent, though I admit I never tried. My rough intuition is that since each transformation represents an equivalent view of the entire data, albeit from a different perspective, the distance remains the same regardless of the transformation. (Note that, on the other hand, the ALR distance is actually different...) FYI this is a short and sweet article on compositional distance that I enjoyed reading https://www.researchgate.net/publication/226002937_Logratio_Analysis_and_Compositional_Distance I hope this helps, and do let me know if you have some more questions! |
Thanks for intro into CoDa! This seems to be relatively new concepts to mainstream informaticians but I feel like the CoDa methods will be industry standard soon. Hopefully sooner than later. The biggest hurdle will be convincing some of the older generation to abandon the scripts they've used for analysis for 15+ years. That really helped a lot in understanding how IQLR works. Reading code in Python and R is SO much easier for me to understand than some of the statistics equations unfortunately. Essentially, we are looking at the: 1) variance of the CLR transformed compositions; 2) the components in the IQR are then used for the reference when we take the geometric mean; 3) then use this vector when we center the components again.
Extra) I'm trying to understand the Line 202 in 2bd7c44
Rho seems to be the most useful metric to serve as a stand in replacement to existing association-based network analysis like WGCNA
|
I do hope CoDa will grow in popularity. The reason I like it so much is that I think it brings a coherent theoretical framework to a field currently relying on ad hoc procedures. As for your questions, (1) Yes, this is true for me too! Even when I encounter a maths equation now, I try to break it down into pseudo-code to understand it :-) (2) I don't see why it wouldn't work for meta-transcriptomics. All these data are produced from the same assay after all. I started out looking at RNA-Seq data, comparing edgeR/DESeq2 with ALDEx2, and finding a surprising amount of convergence https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2261-8 Note though that the CLR is not a golden bullet. It has some nice properties, but it is a transformation not a normalization. The new features you work worth are the counts of the gene relative to the sample mean. They are not absolute counts in any way, shape, or form! (3) The IQLR distance is not the same as the CLR/ILR distance. I think of the IQLR as a compromise between the CLR and the edgeR/DESeq2 approach. It's useful because it is similar to traditional methods, and so easy to explain to reviewers. However, it sort of invokes the same problematic assumptions used by edgeR/DESeq2 approach. These days I prefer balances and pairwise log-ratios for my own analyses (though I often still use IQLR/CLR when working with collaborators) (4) C++ is fast, but I agree it is hard to read. I have saved the original (slow) base R functions, which I still use in my unit tests. Have a look at propr:::proprRho() function at https://github.com/tpq/propr/blob/master/R/9-deprecated.R I think rho is a safe choice (and was ranked #1 by a independent benchmark using single-cell data! -- https://www.ncbi.nlm.nih.gov/pubmed/30962620) |
(2) Ok great! I fear that I've been overthinking this issue. Whenever I see 1,000,000+ ORFs I start thinking twice about everything haha. Good to know that it's still applicable. (3) I just calculated PCA for IQLR and it is different from PCA for ILR and CLR (as you mentioned and as expected).
Are you talking about ILR and then splitting out the balances downstream? (4)Thanks for the links to more human readable versions (one day I'll take a crack at learning C++). I'm going to suggest presenting this paper in lab meeting soon because it'll introduce some people in my lab to fundamental concepts that have been ignored by most "traditional" (for lack of a better word) methods.
Are the results for between the two methods the same or comparable? Last question, I've been using |
"Are you talking about ILR and then splitting out the balances downstream?" I view balances as a subset of the ILR. The ILR requires an "orthonormal basis" to perform a transformation. There are many such bases! Balances satisfy this definition too, so one can think of balances as a more interpretable ILR. selbal is a very cool package, but quite slow when you have 100s+ features. https://msystems.asm.org/content/3/4/e00053-18 We recently published in this area too with an alternative that is faster for big data. https://msystems.asm.org/content/5/2/e00230-19 "Are the results for between the two methods the same or comparable?" I assume here you are asking about the two ways of calculating rho. Yes, they are equivalent. One can be derived from the other! "Last question, I've been using propr::propr(X, metric="rho") where X is the abundance. If I already have clr computed, X_clr" I've actually implemented exactly this just a few weeks ago! I made it for multi-omics problems where you want to CLR each data set separately (see https://www.biorxiv.org/content/10.1101/847475v1 for a discussion). Sorry, it's not documented yet, but here is a code snippet. Basically just set
|
This was a fun thread to read, thanks! When I CLR two relative abundance data (shotgun metagenomics and flow cytometry) and The clr transformed data does have negative values, how do you suggest I go ahead? PS: running |
Could you update propr with |
Oh, I think |
Ah, I think it is implemented on GitHub only, not CRAN... |
Thanks for taking a look! Updated and the error still persists. |
Thanks for checking. Does this code work?
|
Also please let me know what |
But this will do, I can run them locally. Thank you so much ! |
Hm, my guess is that the cluster is not loading 4.2.8 correctly for some reason. In an older version of R, |
I think you are right. I'll look into this, thank you for your help!:) |
Hi Thom,
I remember listening to your bioformatics chat podcast (https://bioinformatics.chat/propr) a while back and recently dove into the literature on several compositional data analysis reviews/methods/papers such as https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6755255/ and https://academic.oup.com/bioinformatics/article/34/16/2870/4956011 where you just so happened to be the leading author! First off, I just want to thank you for publishing on this topic because I have been actively trying to incorporate this in my everyday analysis of environmental/human metagenomics and convince my lab that non-CODA aware methods should not be used for most (all?) NGS analysis.
Anyways, I had a few questions:
In IQLR, is the "interquartile range" (IQR) based on: (case 1) a consensus set from all the samples or (case 2) is it feature/component agnostic per sample?
For example the latter would be sample_A having {OTU_X, OTU_Y, and OTU_Z} in the IQR but sample_B might have {OTU_A, OTU_X, and OTU_Y} in the IQR. Would the IQLR transformation use {OTU_X, Y, and Z} for sample_A and {OTU_A, OTU_X, OTU_Y} in sample_B (case 1)? Would the components used for reference be only {OTU_X, OTU_Y} (case 2)? Or something completely different?
When using CODA methods for metagenomics, can we use CLR for the ENTIRE metagenome or do we need to group it by some predefined category? For example, let's say we (tried to) assign taxonomy to each ORF then propagated that up to contig (perhaps MAG/BIN in the best case), would we need to use CLR/IQLR on each "group" (e.g. species mapping) (case 1)? Or would it be sufficient to use CLR/IQLR on the global metagenome and non subset by any grouping (case 2)?
Why does Aitichison distance use CLR instead of ILR if the individual components do not matter? For example, if you have a matrix X with n=samples rows and m=components columns and you end up with a distance matrix that is (n,n)...wouldn't ILR be a better option?
Best,
-Josh
The text was updated successfully, but these errors were encountered: