Example of scATAC-seq pseudobulk analysis for differential accessibility testing #11

jeremymsimon · 2023-10-26T15:25:56Z

Could be downstream of #1

Work through multi-sample multi-condition differential test using pseudobulked counts, a la a hybrid of muscat and csaw

The text was updated successfully, but these errors were encountered:

paupaiz · 2023-11-18T15:04:17Z

Can I use ArchR for this example?

jeremymsimon · 2023-11-20T14:42:08Z

@paupaiz possibly, could you briefly describe more? My intention here is to be able to compare accessibility between experimental conditions (e.g. WT and mutant) where we have multiple replicates of each. The test would produce cluster-by-cluster results of genomic regions, log-fold-changes, and adjusted p-values.

The ArchR documentation on this here uses an example of one cell type vs another (ie finding marker regions that are specific to one cell type), which IMO is a slightly different problem, but as long as it does this in a replicate-aware fashion it could also work?

stemangiola · 2023-11-21T22:12:09Z

Could be downstream of #1

Work through multi-sample multi-condition differential test using pseudobulked counts, a la a hybrid of muscat and csaw

For multilevel pseudobulk analyses, tidybulk has incorporated glmmSeq and optimised for very large datasets.

https://github.com/stemangiola/tidybulk/blob/512a534bb3bc6e933a78ebbfa1bfaea6d2f6cb38/tests/testthat/test-bulk_methods_SummarizedExperiment.R#L481

see ?test_differential_abundance in the new github release

HPCell (https://github.com/stemangiola/HPCell), in development but functioning, scales multilevel analyses, to the cluster, with no coding overload. We will announce this functionality soon.

paupaiz · 2023-11-22T14:34:40Z

@stemangiola @jeremymsimon I developed condition-aware pseudobulking for ArchR motivated by this paper We will include this in the next release (1.0.3). Let me know if it would be useful to have an example here!

AmelZulji · 2023-11-27T13:59:16Z

I have working example using Signac and Deseq2. Am I eligible to contribute/provide the code? I am sorry for the basic question but could not find anything in the guidelines for contribution.

AmelZulji · 2023-12-01T10:16:25Z

@jeremymsimon, @stemangiola, @paupaiz
What is your opinion - is there any benefit to go with SIgnac/DEseq2 example?

paupaiz · 2023-12-01T15:27:39Z

@AmelZulji would love to check out your post so I can answer!

AmelZulji · 2023-12-06T13:23:52Z

Hi @paupaiz, @jeremymsimon, @stemangiola

Here is the reproducible example the code:

# create directory for downloading files
dir.create("tmp")

# downmload files in the tmp directory
download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5",
              destfile = "tmp/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")

download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_singlecell.csv",
              destfile = "tmp/atac_v1_pbmc_10k_singlecell.csv")

download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz",
              destfile = "tmp/atac_v1_pbmc_10k_fragments.tsv.gz")

download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz.tbi",
              destfile = "tmp/atac_v1_pbmc_10k_fragments.tsv.gz.tbi")


library(Seurat) # for convinient functions
library(Signac) # for handling ATAC data

# construct assay with ATAC data
counts <- Read10X_h5(filename = "tmp/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "tmp/atac_v1_pbmc_10k_singlecell.csv",
  header = TRUE,
  row.names = 1
)

chrom_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"),
  fragments = 'tmp/atac_v1_pbmc_10k_fragments.tsv.gz',
  min.cells = 10,
  min.features = 200
)

pbmc <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = metadata
)

# a basic processing adapted from https://stuartlab.org/signac/articles/pbmc_vignette 

# extract gene annotations from EnsDb
library(EnsDb.Hsapiens.v86)
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)

# change to UCSC style since the data was mapped to hg19
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg19"

# add the gene information to the object
Annotation(pbmc) <- annotations
pbmc <- subset(x = pbmc, subset = nCount_peaks > 3000 &  nCount_peaks < 30000)
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3, resolution = 0.1)
DimPlot(object = pbmc, label = TRUE) + NoLegend()
pbmc[["cell_type"]] <- Idents(pbmc)

# subset to speed up the process
pbmc_sub <- subset(pbmc, cell_type %in% c(0,1))

# add column with donor, sex and condition to simulate groups for testing
obj_meta <- [email protected]
obj_meta[["donor"]] <- sample(x = paste0("donor_", 1:4), size = nrow(obj_meta), replace = T)
obj_meta[["barcode"]] <- rownames(obj_meta)

df <-
  tibble::tibble(
    donor = unique(obj_meta$donor),
    condition = rep(c("Control", "Disease"), c(2,2)),
    sex = rep(c("F", "M"), c(2))
  )

obj_meta <- dplyr::left_join(obj_meta, df)
rownames(obj_meta) <- obj_meta$barcode
[email protected] <- obj_meta

# aggregate counts by summing up counts
pb_counts <- AggregateExpression(pbmc_sub, assays = "peaks", slot = "counts",group.by = "donor")$peaks

# adjust metadata
pb_meta <- [email protected][,c("condition", "sex", "donor")] |> unique()
rownames(pb_meta) <- pb_meta$donor

library(DESeq2)

# match ordering in metadata and count data
sample_meta <- pb_meta[match(colnames(pb_counts), rownames(pb_meta)),]

dds <- DESeqDataSetFromMatrix(pb_counts, 
                              colData = sample_meta, 
                              design = ~ sex + condition)
dds <- DESeq(dds)

res <- results(dds) |> as.data.frame()
head(res)

jeremymsimon · 2023-12-06T13:28:39Z

Thanks @AmelZulji this is a helpful place to start and may work for many cases. When I first posted this challenge I was envisioning including a local background correction as well like csaw does with broader sliding windows. What do you all think about this? Would it be easy to compute and construct an equivalent test here for scATAC data?

AmelZulji · 2023-12-06T15:08:35Z

Thanks @jeremymsimon, in my opinion, the correction in csawis needed because of the way how it counts (divide genome on equal bins and count within each bin). In ATAC from the other hand, data are aggregated from all cells in order to increase singal-to-noise ratio and then empirical peaks are called for the dataset.

The only concern might be "double dipping" as mentioned at the end of the first paragraph in this chapter of csaw book https://bioconductor.org/books/3.13/csawBook/counting-reads-into-windows.html#background and as explained in this paper https://www.biorxiv.org/content/10.1101/2023.07.21.550107v1.full.pdf (though for single cell data and in genral related more to clusters markers rather than DEGs between condition within the same cluster).

Please correct me if im wrong or misunderstood your points

AmelZulji · 2023-12-12T13:11:28Z

Any comments on this @jeremymsimon ? I would be interested to work it further on this if you have suggestion.

Regards,
Amel

jeremymsimon · 2023-12-12T14:39:17Z

Hi @AmelZulji I'm not a statistician but I think these are separate issues. The issue regarding "double-dipping" is related to constructing your peak set in a condition-aware fashion, over which the differential test will then be conducted. In other words, if you call peaks in ConditionA, then call peaks in ConditionB, then merge into a union set and test over those windows, you risk losing error control.

csaw's binning (like 10kb windows) would be separate from this and is a way of correcting for composition bias, which I suspect may be even more important in scATAC-seq data given how sparse the true enrichment is.

What I'm proposing here is to do the following:

Pseudobulk signals such that you have a matrix of all peaks (and/or small windows) by all replicates/conditions, for each scATAC cluster. If using peaks, they would have been already identified in a condition-agnostic fashion, at least if you follow the typical published workflows
In parallel, construct 10kb (?) windows and compute those counts
Test for DA just like csaw does, incorporating the 10kb window counts as background correction

@stemangiola apparently has methods (mentioned above) for doing some of this, but IMO it would be nice to work through a solution that includes the local background/composition bias correction that broad window counts provides

None of this has been published before for scATAC-seq, AFAIK, so this would provide a means for us to evaluate whether there is any benefit to testing in this fashion and context

jeremymsimon added this to tidyomics open challenges Oct 26, 2023

jeremymsimon converted this from a draft issue Oct 26, 2023

stemangiola moved this to Todo in tidyomics open challenges Nov 10, 2023

mikelove added the documentation Improvements or additions to documentation label Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example of scATAC-seq pseudobulk analysis for differential accessibility testing #11

Example of scATAC-seq pseudobulk analysis for differential accessibility testing #11

jeremymsimon commented Oct 26, 2023

paupaiz commented Nov 18, 2023

jeremymsimon commented Nov 20, 2023

stemangiola commented Nov 21, 2023 •

edited

Loading

paupaiz commented Nov 22, 2023

AmelZulji commented Nov 27, 2023

AmelZulji commented Dec 1, 2023

paupaiz commented Dec 1, 2023

AmelZulji commented Dec 6, 2023

jeremymsimon commented Dec 6, 2023

AmelZulji commented Dec 6, 2023 •

edited

Loading

AmelZulji commented Dec 12, 2023

jeremymsimon commented Dec 12, 2023

Example of scATAC-seq pseudobulk analysis for differential accessibility testing #11

Example of scATAC-seq pseudobulk analysis for differential accessibility testing #11

Comments

jeremymsimon commented Oct 26, 2023

paupaiz commented Nov 18, 2023

jeremymsimon commented Nov 20, 2023

stemangiola commented Nov 21, 2023 • edited Loading

paupaiz commented Nov 22, 2023

AmelZulji commented Nov 27, 2023

AmelZulji commented Dec 1, 2023

paupaiz commented Dec 1, 2023

AmelZulji commented Dec 6, 2023

jeremymsimon commented Dec 6, 2023

AmelZulji commented Dec 6, 2023 • edited Loading

AmelZulji commented Dec 12, 2023

jeremymsimon commented Dec 12, 2023

stemangiola commented Nov 21, 2023 •

edited

Loading

AmelZulji commented Dec 6, 2023 •

edited

Loading