Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about dimension reduction for single-cell and other data #21

Open
SilasK opened this issue Jan 5, 2021 · 3 comments
Open

Question about dimension reduction for single-cell and other data #21

SilasK opened this issue Jan 5, 2021 · 3 comments
Labels
helpful This question has been marked as potentially helpful to others. question

Comments

@SilasK
Copy link

SilasK commented Jan 5, 2021

As outlined in A field guide for the compositional analysis of any-omics data CoDa should be the way to do single-cell RNA seq.

I'd like to cluster cells (into cell-types) using CoDA.
The first step is to do a dimensional reduction (DR) then clustering.

What would be the best way to do the DR? I see two ways of doing this:

  1. Apply CLR (or some variant e.g. iqLR) and perform PCA.
  2. Use proper phi as association strength between cells (usually propr is uses for the association between genes) then do a t-SNE.

As far as I understood it, the 'most common' way it is done in scSeq would be:

  • Do some normalisation, there are many different ways
  • Take most important PCA components
  • Build a NN-graph and apply t-SNE/umap (and do some graph based culstering)
@SilasK
Copy link
Author

SilasK commented Jan 5, 2021

I tried the CLR-PCA and found that the dominating 1 PC is strongly correlated with the number of counts. I hoped that the CoDa based method would remove (at least part) of this bias.

@tpq
Copy link
Owner

tpq commented Jan 5, 2021

Hey SilasK, thanks for your interest in propr!

Regarding dimensional reduction in CoDa, I would tend to run a CLR (or some variant) on the sample rows, then perform a PCA. You can use phi as a kind of distance measure, but typically phi is used to describe distances between features rather than distances between samples. When calculating phi on the transpose, the features (not the samples) would get CLR-transformed which doesn't make much sense to me.

I think the scSeq workflow is the pretty much the same, except that the normalization step is replaced with a CLR. In fact, CLR can be thought of as a kind of normalization that isn't too different from effective size library normalization. We tried to make this a bit clearer in the section "The Quest for a Common Scale" https://academic.oup.com/nargab/article/2/4/lqaa103/6028739 . I have not thought much about using t-SNE/UMAP for CLR-transformed data, but I have fit multi-layer perceptrons (NN) on CLR-transformed data which works nicely https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-6652-7 .

@tpq
Copy link
Owner

tpq commented Jan 5, 2021

Regarding "I tried the CLR-PCA and found that the dominating 1 PC is strongly correlated with the number of counts. I hoped that the CoDa based method would remove (at least part) of this bias." -- This is a very tricky one. I'll jot down some possible causes and solutions below.

I have seen this before when the number of zeros differs greatly between samples. During zero imputation, zeros are replaced with a very small number. The CLR requires a geometric mean of the sample. When a sample has more zeros, it gets imputed to have more small numbers, which pulls the whole geometric mean down. If this event is significant, the geometric mean "normalizing factor" begins to correlate with total counts (as do many genes). My guess is that the first PCA reflects this process.

What to do about it? Assuming the problem is in fact due to differences in the number of zeros between samples, the trick here is to even out the influence of zeros somehow. A few ideas,

(1) Use a different reference. Martino et al. propose the "robust CLR" which replaces the geometric mean CLR with a reference computed from the non-zero elements only. IIRC this is somewhat similar to what DESeq2 recommends.
https://msystems.asm.org/content/4/1/e00016-19

(2) Rarefaction! I know this is a bit taboo, but if you down-sample all your data to have the same total sequencing depth, you remove the effect of sequence depth altogether (though you do not remove the effect of compositionality). If the sequencing depth is the cause of the differences in zeros, this should "even out" the number of zeros present in your data.

(3) Use a method that does not depend on CLR or zero imputation. We propose data-driven amalgamation to learn useful lower-dimension representations of the data. In short, it sums parts (e.g., genes) into groups to form a smaller simplex that approximately represents the larger simplex (by minimizing a loss function). It has an R package `amalgam'.
https://academic.oup.com/nargab/article/2/4/lqaa076/5917300

@tpq tpq changed the title Question about how to use propr for single-cell classification Question about dimension reduction for single-cell data Jan 5, 2021
@tpq tpq added helpful This question has been marked as potentially helpful to others. question labels Jan 5, 2021
@tpq tpq changed the title Question about dimension reduction for single-cell data Question about dimension reduction for single-cell and other data Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helpful This question has been marked as potentially helpful to others. question
Projects
None yet
Development

No branches or pull requests

2 participants