Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using stdeconvolve with normalized / integrated data #25

Open
Acaro12 opened this issue Oct 27, 2022 · 6 comments
Open

Using stdeconvolve with normalized / integrated data #25

Acaro12 opened this issue Oct 27, 2022 · 6 comments

Comments

@Acaro12
Copy link

Acaro12 commented Oct 27, 2022

Dear Brendan,

I am using stdeconvolve with an 8-sample visium integrated seurat object. All samples were individually normalized with seurat's sctransform algorithm before anchor-based integration (also seurat toolkit).

The data output from sctransform (and consequently also after integration) can be negative and is of data format double. Hence, it cannot be used with stdeconvolve.

I have two questions:

  1. why are non-negative integer required for stdeconvolve?
  2. could you think of a way to transform the data in accordance with the algorithms requirements? Would a simple as.integer() + x be a valid way to do this?

Thank you so much in advance for your time!
Best,
Christoph

@bmill3r
Copy link
Collaborator

bmill3r commented Oct 28, 2022

Hi @Acaro12,

Thanks so much for using STdeconvolve and for your questions!

The reason why STdeconvolve requires non-negative integers basically boils down to the fact that latent Dirichlet allocation requires frequency counts of words or terms, specified as a matrix of nonnegative integers. In this case, our terms are genes, but the same idea holds.

With respect to combining multiple datasets, you could follow our strategy when analyzing the 4 breast cancer sections. Essentially, we take the union of overdispersed genes determined for each of the sections separately, then fit LDA models on the merged dataset, which is all the spots and the combined set of overdispersed genes. Note that in this case, all of the sections were taken from the same biopsy and so it is reasonable to assume that the technical variation between them should be low. If the sections are from different samples, then it might be more appropriate to analyze each separately. We have done this on datasets generated from different samples from the same tissue type (mouse olfactory bulb) and we have found high concordance between the deconvolved cell types (see Supplementary Figure S7). So although each sample is processed separately, STdeconvolve will likely find similar cell types if their gene expression profiles are distinct in the different datasets.

Hope this helps and let me know if you have any other questions,
Brendan

@joachimsiaw
Copy link

Hi,
I am also using STdeconvolve and i find it very interesting. Thanks for the great work on this tool. I have a question though, regarding normalisation of gene expression across the spots. So in seurat, it is recommended to first normalize the data in order to account for variance in sequencing depth across spots/data points. It known that for instance in 10X Visium, variance in molecular counts per spot can be substantial, particularly if there are differences in cell density across the tissue. I haven't seen any of such normalisation in the STdeconvolve pipeline. Could you please explain to me how the concern of variance in molecular counts or uneven cell density across the tissue is accounted for in your pipeline or why it may not be needed?

How do you think such variance could potentially affect the gene expression profiles of the STdeconvolve resolved topics or cell types?
Thank you in advance.

@bmill3r
Copy link
Collaborator

bmill3r commented Apr 17, 2023

Hi @joachimsiaw

Thanks for your question!

Essentially, the total counts per spot are treated as independent from all the other data generating variables in the LDA model. Therefore there is no need to depth normalize the total counts in each spot like there is for scRNA-seq data, for example. Additionally, LDA requires frequency counts of words or terms, specified as a matrix of nonnegative integers and so transforming the values to non-integers would be incompatible.

We do however preprocess the data to remove poorly captured genes and low quality spots. We also feature select for overdispersed genes across spots as proxy for cell type specific gene expression. It's possible that large variations in cell density and thus total gene counts in spots could affect the genes that are detected as being overdispersed.

I'll also add that we tested STdeconvolve on the same simulated dataset using different spot sizes (thus varying the cell density range from 1-20+ cells) and observed that the accuracy was stable across spot resolutions. So it seems that cell density does have a major effect in the deconvolution as long as cell type specific groups of co-occurring genes were captured efficiently.

Hope this helps,
Brendan

@joachimsiaw
Copy link

@bmill3r
Thank you for your quick and insightful response. It is clear to be me now why normalization is not need for the the deconvolution.

Can you comment on how the gene expression profiles of the topics are generated?

  1. For eg. , for each gene, does the expression value represent a mean or median expression across all spots?
  2. And if so, how do you think varying cell density could influence this?
  3. If the expressions values are means or medians, dont you think this could lead to lost of spatial-constrained cell-cell communication information?? I want to use the tool MERINGUE from your group to perform Spatially-informed transcriptional clustering and i was wandering how this could be possible, in light of my question in 1 above.
    Thank you in advance.

@bmill3r
Copy link
Collaborator

bmill3r commented Apr 19, 2023

Hi @joachimsiaw

The gene expression profiles of the deconvolved cell types are essentially probability distributions of each deconvolved cell type over the genes (and not means or medians). In the context of STdeconvolve, they can be thought of as the probability of a gene being expressed by a given cell type. I would recommend checking out some background on LDA for more information.

Hope this answers your question,
Brendan

@JEFworks
Copy link
Collaborator

JEFworks commented May 4, 2023

Hi everyone,

This blog post and accompanying video walking through a simulation-based approach for exploring why we don't need normalization with STdeconvolve may be useful for you as you explore these interesting questions in the context of your own research pursuits: https://jef.works/blog/2023/05/04/normalization-clustering-vs-deconvolution/

Hope it helps,
Jean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants