Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of scATAC-seq pseudobulk analysis for differential accessibility testing #11

Open
jeremymsimon opened this issue Oct 26, 2023 · 12 comments
Labels
documentation Improvements or additions to documentation

Comments

@jeremymsimon
Copy link

Could be downstream of #1

Work through multi-sample multi-condition differential test using pseudobulked counts, a la a hybrid of muscat and csaw

@paupaiz
Copy link

paupaiz commented Nov 18, 2023

Can I use ArchR for this example?

@jeremymsimon
Copy link
Author

@paupaiz possibly, could you briefly describe more? My intention here is to be able to compare accessibility between experimental conditions (e.g. WT and mutant) where we have multiple replicates of each. The test would produce cluster-by-cluster results of genomic regions, log-fold-changes, and adjusted p-values.

The ArchR documentation on this here uses an example of one cell type vs another (ie finding marker regions that are specific to one cell type), which IMO is a slightly different problem, but as long as it does this in a replicate-aware fashion it could also work?

@stemangiola
Copy link

stemangiola commented Nov 21, 2023

Could be downstream of #1

Work through multi-sample multi-condition differential test using pseudobulked counts, a la a hybrid of muscat and csaw

For multilevel pseudobulk analyses, tidybulk has incorporated glmmSeq and optimised for very large datasets.

https://github.com/stemangiola/tidybulk/blob/512a534bb3bc6e933a78ebbfa1bfaea6d2f6cb38/tests/testthat/test-bulk_methods_SummarizedExperiment.R#L481

see ?test_differential_abundance in the new github release

HPCell (https://github.com/stemangiola/HPCell), in development but functioning, scales multilevel analyses, to the cluster, with no coding overload. We will announce this functionality soon.

@paupaiz
Copy link

paupaiz commented Nov 22, 2023

@stemangiola @jeremymsimon I developed condition-aware pseudobulking for ArchR motivated by this paper We will include this in the next release (1.0.3). Let me know if it would be useful to have an example here!

@AmelZulji
Copy link

I have working example using Signac and Deseq2. Am I eligible to contribute/provide the code? I am sorry for the basic question but could not find anything in the guidelines for contribution.

@AmelZulji
Copy link

@jeremymsimon, @stemangiola, @paupaiz
What is your opinion - is there any benefit to go with SIgnac/DEseq2 example?

@paupaiz
Copy link

paupaiz commented Dec 1, 2023

@AmelZulji would love to check out your post so I can answer!

@AmelZulji
Copy link

Hi @paupaiz, @jeremymsimon, @stemangiola

Here is the reproducible example the code:

# create directory for downloading files
dir.create("tmp")

# downmload files in the tmp directory
download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5",
              destfile = "tmp/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")

download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_singlecell.csv",
              destfile = "tmp/atac_v1_pbmc_10k_singlecell.csv")

download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz",
              destfile = "tmp/atac_v1_pbmc_10k_fragments.tsv.gz")

download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz.tbi",
              destfile = "tmp/atac_v1_pbmc_10k_fragments.tsv.gz.tbi")


library(Seurat) # for convinient functions
library(Signac) # for handling ATAC data

# construct assay with ATAC data
counts <- Read10X_h5(filename = "tmp/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "tmp/atac_v1_pbmc_10k_singlecell.csv",
  header = TRUE,
  row.names = 1
)

chrom_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"),
  fragments = 'tmp/atac_v1_pbmc_10k_fragments.tsv.gz',
  min.cells = 10,
  min.features = 200
)

pbmc <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = metadata
)

# a basic processing adapted from https://stuartlab.org/signac/articles/pbmc_vignette 

# extract gene annotations from EnsDb
library(EnsDb.Hsapiens.v86)
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)

# change to UCSC style since the data was mapped to hg19
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg19"

# add the gene information to the object
Annotation(pbmc) <- annotations
pbmc <- subset(x = pbmc, subset = nCount_peaks > 3000 &  nCount_peaks < 30000)
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3, resolution = 0.1)
DimPlot(object = pbmc, label = TRUE) + NoLegend()
pbmc[["cell_type"]] <- Idents(pbmc)

# subset to speed up the process
pbmc_sub <- subset(pbmc, cell_type %in% c(0,1))

# add column with donor, sex and condition to simulate groups for testing
obj_meta <- [email protected]
obj_meta[["donor"]] <- sample(x = paste0("donor_", 1:4), size = nrow(obj_meta), replace = T)
obj_meta[["barcode"]] <- rownames(obj_meta)

df <-
  tibble::tibble(
    donor = unique(obj_meta$donor),
    condition = rep(c("Control", "Disease"), c(2,2)),
    sex = rep(c("F", "M"), c(2))
  )

obj_meta <- dplyr::left_join(obj_meta, df)
rownames(obj_meta) <- obj_meta$barcode
[email protected] <- obj_meta

# aggregate counts by summing up counts
pb_counts <- AggregateExpression(pbmc_sub, assays = "peaks", slot = "counts",group.by = "donor")$peaks

# adjust metadata
pb_meta <- [email protected][,c("condition", "sex", "donor")] |> unique()
rownames(pb_meta) <- pb_meta$donor

library(DESeq2)

# match ordering in metadata and count data
sample_meta <- pb_meta[match(colnames(pb_counts), rownames(pb_meta)),]

dds <- DESeqDataSetFromMatrix(pb_counts, 
                              colData = sample_meta, 
                              design = ~ sex + condition)
dds <- DESeq(dds)

res <- results(dds) |> as.data.frame()
head(res)

@jeremymsimon
Copy link
Author

Thanks @AmelZulji this is a helpful place to start and may work for many cases. When I first posted this challenge I was envisioning including a local background correction as well like csaw does with broader sliding windows. What do you all think about this? Would it be easy to compute and construct an equivalent test here for scATAC data?

@AmelZulji
Copy link

AmelZulji commented Dec 6, 2023

Thanks @jeremymsimon, in my opinion, the correction in csawis needed because of the way how it counts (divide genome on equal bins and count within each bin). In ATAC from the other hand, data are aggregated from all cells in order to increase singal-to-noise ratio and then empirical peaks are called for the dataset.

The only concern might be "double dipping" as mentioned at the end of the first paragraph in this chapter of csaw book https://bioconductor.org/books/3.13/csawBook/counting-reads-into-windows.html#background and as explained in this paper https://www.biorxiv.org/content/10.1101/2023.07.21.550107v1.full.pdf (though for single cell data and in genral related more to clusters markers rather than DEGs between condition within the same cluster).

Please correct me if im wrong or misunderstood your points

@AmelZulji
Copy link

Any comments on this @jeremymsimon ? I would be interested to work it further on this if you have suggestion.

Regards,
Amel

@jeremymsimon
Copy link
Author

Hi @AmelZulji I'm not a statistician but I think these are separate issues. The issue regarding "double-dipping" is related to constructing your peak set in a condition-aware fashion, over which the differential test will then be conducted. In other words, if you call peaks in ConditionA, then call peaks in ConditionB, then merge into a union set and test over those windows, you risk losing error control.

csaw's binning (like 10kb windows) would be separate from this and is a way of correcting for composition bias, which I suspect may be even more important in scATAC-seq data given how sparse the true enrichment is.

What I'm proposing here is to do the following:

  • Pseudobulk signals such that you have a matrix of all peaks (and/or small windows) by all replicates/conditions, for each scATAC cluster. If using peaks, they would have been already identified in a condition-agnostic fashion, at least if you follow the typical published workflows
  • In parallel, construct 10kb (?) windows and compute those counts
  • Test for DA just like csaw does, incorporating the 10kb window counts as background correction

@stemangiola apparently has methods (mentioned above) for doing some of this, but IMO it would be nice to work through a solution that includes the local background/composition bias correction that broad window counts provides

None of this has been published before for scATAC-seq, AFAIK, so this would provide a means for us to evaluate whether there is any benefit to testing in this fashion and context

@mikelove mikelove added the documentation Improvements or additions to documentation label Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Todo
Development

No branches or pull requests

5 participants