Created by: Ahmed Mahfouz
In this tutorial we will look at different ways of integrating multiple single cell RNA-seq datasets. We will explore two different methods to correct for batch effects across datasets. We will also look at a quantitative measure to assess the quality of the integrated data.
For this tutorial we will use 3 different human pancreatic islet cell datasets from four technologies, CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061). The raw data matrix and associated metadata are available here.
Load required packages:
suppressMessages(require(Seurat))
suppressMessages(require(ggplot2))
suppressMessages(require(cowplot))
suppressMessages(require(scater))
suppressMessages(require(scran))
suppressMessages(require(BiocParallel))
suppressMessages(require(BiocNeighbors))
First we will use the data integration method presented in Comprehensive Integration of Single Cell Data.
Load in expression matrix and metadata. The metadata file contains the
technology (tech
column) and cell type annotations (cell type
column) for each cell in the four
datasets.
pancreas.data <- readRDS(file = "session-integration_files/pancreas_expression_matrix.rds")
metadata <- readRDS(file = "session-integration_files/pancreas_metadata.rds")
Create a Seurat object with all datasets.
pancreas <- CreateSeuratObject(pancreas.data, meta.data = metadata)
Let’s first look at the datasets before applying any batch correction.
We perform standard preprocessing (log-normalization), and identify
variable features based on a variance stabilizing transformation
("vst"
). Next, we scale the integrated data, run PCA, and visualize
the results with UMAP. The integrated datasets cluster by cell type,
instead of by technology
# Normalize and find variable features
pancreas <- NormalizeData(pancreas, verbose = FALSE)
pancreas <- FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
# Run the standard workflow for visualization and clustering
pancreas <- ScaleData(pancreas, verbose = FALSE)
pancreas <- RunPCA(pancreas, npcs = 30, verbose = FALSE)
pancreas <- RunUMAP(pancreas, reduction = "pca", dims = 1:30)
p1 <- DimPlot(pancreas, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
NoLegend()
plot_grid(p1, p2)
We split the combined object into a list, with each dataset as an
element. We perform standard preprocessing (log-normalization), and
identify variable features individually for each dataset based on a
variance stabilizing transformation ("vst"
).
pancreas.list <- SplitObject(pancreas, split.by = "tech")
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
verbose = FALSE)
}
We identify anchors using the FindIntegrationAnchors
function, which
takes a list of Seurat objects as
input.
reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2", "fluidigmc1")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
## Computing 2000 integration features
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3499 anchors
## Filtering anchors
## Retained 2821 anchors
## Extracting within-dataset neighbors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3515 anchors
## Filtering anchors
## Retained 2701 anchors
## Extracting within-dataset neighbors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 6173 anchors
## Filtering anchors
## Retained 4634 anchors
## Extracting within-dataset neighbors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 2176 anchors
## Filtering anchors
## Retained 1841 anchors
## Extracting within-dataset neighbors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 2774 anchors
## Filtering anchors
## Retained 2478 anchors
## Extracting within-dataset neighbors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 2723 anchors
## Filtering anchors
## Retained 2410 anchors
## Extracting within-dataset neighbors
We then pass these anchors to the IntegrateData
function, which
returns a Seurat
object.
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
## Merging dataset 4 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 1 into 2 4
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 3 into 2 4 1
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
After running IntegrateData
, the Seurat
object will contain a new
Assay
with the integrated (or ‘batch-corrected’) expression matrix.
Note that the original (uncorrected values) are still stored in the
object in the “RNA” assay, so you can switch back and forth.
We can then use this new integrated matrix for downstream analysis and visualization. Here we scale the integrated data, run PCA, and visualize the results with UMAP. The integrated datasets cluster by cell type, instead of by technology.
# switch to integrated assay. The variable features of this assay are automatically set during
# IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"
# Run the standard workflow for visualization and clustering
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
NoLegend()
plot_grid(p1, p2)
An alternative approach to integrate single cell RNA-seq data is to use the Mutual Nearest Neighbor (MNN) batch-correction method by Haghverdi et al..
You can either create a SingleCellExperiment
(SCE) object directly
from the count matrices, or convert directly from Seurat to SCE.
celseq.data <- as.SingleCellExperiment(pancreas.list$celseq)
celseq2.data <- as.SingleCellExperiment(pancreas.list$celseq2)
fluidigmc1.data <- as.SingleCellExperiment(pancreas.list$fluidigmc1)
smartseq2.data <- as.SingleCellExperiment(pancreas.list$smartseq2)
Find common genes and reduce each dataset to those common genes
keep_genes <- Reduce(intersect, list(rownames(celseq.data),rownames(celseq2.data),
rownames(fluidigmc1.data),rownames(smartseq2.data)))
celseq.data <- celseq.data[match(keep_genes, rownames(celseq.data)), ]
celseq2.data <- celseq2.data[match(keep_genes, rownames(celseq2.data)), ]
fluidigmc1.data <- fluidigmc1.data[match(keep_genes, rownames(fluidigmc1.data)), ]
smartseq2.data <- smartseq2.data[match(keep_genes, rownames(smartseq2.data)), ]
Calculate quality control characteristics using calculateQCMetrics()
to determine low quality cells by finding outliers with
uncharacteristically low total counts or total number of features
(genes) detected.
## celseq.data
celseq.data <- calculateQCMetrics(celseq.data)
low_lib_celseq.data <- isOutlier(celseq.data$log10_total_counts, type="lower", nmad=3)
low_genes_celseq.data <- isOutlier(celseq.data$log10_total_features_by_counts, type="lower", nmad=3)
celseq.data <- celseq.data[, !(low_lib_celseq.data | low_genes_celseq.data)]
## celseq2.data
celseq2.data <- calculateQCMetrics(celseq2.data)
low_lib_celseq2.data <- isOutlier(celseq2.data$log10_total_counts, type="lower", nmad=3)
low_genes_celseq2.data <- isOutlier(celseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
celseq2.data <- celseq2.data[, !(low_lib_celseq2.data | low_genes_celseq2.data)]
## fluidigmc1.data
fluidigmc1.data <- calculateQCMetrics(fluidigmc1.data)
low_lib_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_counts, type="lower", nmad=3)
low_genes_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_features_by_counts, type="lower", nmad=3)
fluidigmc1.data <- fluidigmc1.data[, !(low_lib_fluidigmc1.data | low_genes_fluidigmc1.data)]
## smartseq2.data
smartseq2.data <- calculateQCMetrics(smartseq2.data)
low_lib_smartseq2.data <- isOutlier(smartseq2.data$log10_total_counts, type="lower", nmad=3)
low_genes_smartseq2.data <- isOutlier(smartseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
smartseq2.data <- smartseq2.data[, !(low_lib_smartseq2.data | low_genes_smartseq2.data)]
Normalize the data by calculating size using the computeSumFactors()
and normalize()
functions from scran
# Compute sizefactors
celseq.data <- computeSumFactors(celseq.data)
celseq2.data <- computeSumFactors(celseq2.data)
fluidigmc1.data <- computeSumFactors(fluidigmc1.data)
smartseq2.data <- computeSumFactors(smartseq2.data)
# Normalize
celseq.data <- normalize(celseq.data)
celseq2.data <- normalize(celseq2.data)
fluidigmc1.data <- normalize(fluidigmc1.data)
smartseq2.data <- normalize(smartseq2.data)
Feature selection: we use the trendVar()
and decomposeVar()
functions to calculate the per gene variance and separate it into
technical versus biological components.
## celseq.data
fit_celseq.data <- trendVar(celseq.data, use.spikes=FALSE)
dec_celseq.data <- decomposeVar(celseq.data, fit_celseq.data)
dec_celseq.data$Symbol_TENx <- rowData(celseq.data)$Symbol_TENx
dec_celseq.data <- dec_celseq.data[order(dec_celseq.data$bio, decreasing = TRUE), ]
## celseq2.data
fit_celseq2.data <- trendVar(celseq2.data, use.spikes=FALSE)
dec_celseq2.data <- decomposeVar(celseq2.data, fit_celseq2.data)
dec_celseq2.data$Symbol_TENx <- rowData(celseq2.data)$Symbol_TENx
dec_celseq2.data <- dec_celseq2.data[order(dec_celseq2.data$bio, decreasing = TRUE), ]
## fluidigmc1.data
fit_fluidigmc1.data <- trendVar(fluidigmc1.data, use.spikes=FALSE)
dec_fluidigmc1.data <- decomposeVar(fluidigmc1.data, fit_fluidigmc1.data)
dec_fluidigmc1.data$Symbol_TENx <- rowData(fluidigmc1.data)$Symbol_TENx
dec_fluidigmc1.data <- dec_fluidigmc1.data[order(dec_fluidigmc1.data$bio, decreasing = TRUE), ]
## smartseq2.data
fit_smartseq2.data <- trendVar(smartseq2.data, use.spikes=FALSE)
dec_smartseq2.data <- decomposeVar(smartseq2.data, fit_smartseq2.data)
dec_smartseq2.data$Symbol_TENx <- rowData(smartseq2.data)$Symbol_TENx
dec_smartseq2.data <- dec_smartseq2.data[order(dec_smartseq2.data$bio, decreasing = TRUE), ]
# select the most informative genes that are shared across all datasets:
universe <- Reduce(intersect, list(rownames(dec_celseq.data),rownames(dec_celseq2.data),
rownames(dec_fluidigmc1.data),rownames(dec_smartseq2.data)))
mean.bio <- (dec_celseq.data[universe,"bio"] + dec_celseq2.data[universe,"bio"] +
dec_fluidigmc1.data[universe,"bio"] + dec_smartseq2.data[universe,"bio"])/4
hvg_genes <- universe[mean.bio > 0]
Combine the datasets into a unified SingleCellExperiment.
# total raw counts
counts_pancreas <- cbind(counts(celseq.data), counts(celseq2.data),
counts(fluidigmc1.data), counts(smartseq2.data))
# total normalized counts (with multibatch normalization)
logcounts_pancreas <- cbind(logcounts(celseq.data), logcounts(celseq2.data),
logcounts(fluidigmc1.data), logcounts(smartseq2.data))
# sce object of the combined data
sce <- SingleCellExperiment(
assays = list(counts = counts_pancreas, logcounts = logcounts_pancreas),
rowData = rowData(celseq.data), # same as rowData(pbmc4k)
colData = rbind(colData(celseq.data), colData(celseq2.data),
colData(fluidigmc1.data), colData(smartseq2.data))
)
# store the hvg_genes from the prior section into the sce object’s metadata slot via:
metadata(sce)$hvg_genes <- hvg_genes
Let’s first look at the datasets before applying MNN batch correction.
sce <- runPCA(sce,
ncomponents = 20,
feature_set = hvg_genes,
method = "irlba")
names(reducedDims(sce)) <- "PCA_naive"
p1 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "tech") +
ggtitle("PCA Without batch correction")
p2 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "celltype") +
ggtitle("PCA Without batch correction")
plot_grid(p1, p2)
The mutual nearest neighbors (MNN) approach within the scran package
utilizes a novel approach to adjust for batch effects. The fastMNN()
function returns a representation of the data with reduced
dimensionality, which can be used in a similar fashion to other
lower-dimensional representations such as PCA. In particular, this
representation can be used for downstream methods such as clustering.
Prior to running fastMNN()
, we rescale each batch to adjust for
differences in sequencing depth between batches. The multiBatchNorm()
function from the scran
package recomputes log-normalized expression
values after adjusting the size factors for systematic differences in
coverage between SingleCellExperiment objects. The previously computed
size factors only remove biases between cells within a single batch.
This improves the quality of the correction step by removing one aspect
of the technical differences between
batches.
rescaled <- multiBatchNorm(celseq.data, celseq2.data, fluidigmc1.data, smartseq2.data)
celseq.data_rescaled <- rescaled[[1]]
celseq2.data_rescaled <- rescaled[[2]]
fluidigmc1.data_rescaled <- rescaled[[3]]
smartseq2.data_rescaled <- rescaled[[4]]
We now run fastMNN. The BNPARAM
can be used to specify the
specific nearest neighbors method to use from the BiocNeighbors package.
Here we make use of the [Annoy library](https://github.com/spotify/annoy)
via the
BiocNeighbors::AnnoyParam()
argument. We save the reduced-dimension
MNN representation into the reducedDims
slot of our sce
object.
Optional: We can enable parallelization via the BPPARAM
argument,
enabling multicore processing.
mnn_out <- fastMNN(celseq.data_rescaled,
celseq2.data_rescaled,
fluidigmc1.data_rescaled,
smartseq2.data_rescaled,
subset.row = metadata(sce)$hvg_genes,
k = 20, d = 50, approximate = TRUE,
# BPPARAM = BiocParallel::MulticoreParam(8),
BNPARAM = BiocNeighbors::AnnoyParam())
reducedDim(sce, "MNN") <- mnn_out$correct
NOTE: fastMNN() does not produce a batch-corrected expression matrix. Thus, the result from fastMNN() should solely be treated as a reduced dimensionality representation, suitable for direct plotting, TSNE/UMAP, clustering, and trajectory analysis that relies on such results.
Plot the batch-corrected data.
p1 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "tech") + ggtitle("MNN Ouput Reduced Dimensions")
p2 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "celltype") + ggtitle("MNN Ouput Reduced Dimensions")
plot_grid(p1, p2)
sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] BiocNeighbors_1.0.0 scran_1.10.2
## [3] scater_1.10.1 SingleCellExperiment_1.4.1
## [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [7] BiocParallel_1.16.6 matrixStats_0.54.0
## [9] Biobase_2.42.0 GenomicRanges_1.34.0
## [11] GenomeInfoDb_1.18.2 IRanges_2.16.0
## [13] S4Vectors_0.20.1 BiocGenerics_0.28.0
## [15] cowplot_0.9.4 ggplot2_3.1.1
## [17] Seurat_3.0.0
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.15 ggbeeswarm_0.6.0
## [3] colorspace_1.4-1 ggridges_0.5.1
## [5] dynamicTreeCut_1.63-1 XVector_0.22.0
## [7] listenv_0.7.0 npsurv_0.4-0
## [9] ggrepel_0.8.1 codetools_0.2-16
## [11] splines_3.5.3 R.methodsS3_1.7.1
## [13] lsei_1.2-0 knitr_1.22
## [15] jsonlite_1.6 ica_1.0-2
## [17] cluster_2.0.7-1 png_0.1-7
## [19] R.oo_1.22.0 HDF5Array_1.10.1
## [21] sctransform_0.2.0 compiler_3.5.3
## [23] httr_1.4.0 assertthat_0.2.1
## [25] Matrix_1.2-15 lazyeval_0.2.2
## [27] limma_3.38.3 htmltools_0.3.6
## [29] tools_3.5.3 rsvd_1.0.0
## [31] igraph_1.2.4.1 gtable_0.3.0
## [33] glue_1.3.1 GenomeInfoDbData_1.2.0
## [35] RANN_2.6.1 reshape2_1.4.3
## [37] dplyr_0.8.0.1 Rcpp_1.0.1
## [39] gdata_2.18.0 ape_5.3
## [41] nlme_3.1-137 DelayedMatrixStats_1.4.0
## [43] gbRd_0.4-11 lmtest_0.9-37
## [45] xfun_0.6 stringr_1.4.0
## [47] globals_0.12.4 irlba_2.3.3
## [49] gtools_3.8.1 statmod_1.4.30
## [51] future_1.13.0 edgeR_3.24.3
## [53] MASS_7.3-51.1 zlibbioc_1.28.0
## [55] zoo_1.8-5 scales_1.0.0
## [57] rhdf5_2.26.2 RColorBrewer_1.1-2
## [59] yaml_2.2.0 reticulate_1.12
## [61] pbapply_1.4-0 gridExtra_2.3
## [63] stringi_1.4.3 caTools_1.17.1.2
## [65] bibtex_0.4.2 Rdpack_0.11-0
## [67] SDMTools_1.1-221.1 rlang_0.3.4
## [69] pkgconfig_2.0.2 bitops_1.0-6
## [71] evaluate_0.13 lattice_0.20-38
## [73] ROCR_1.0-7 purrr_0.3.2
## [75] Rhdf5lib_1.4.3 labeling_0.3
## [77] htmlwidgets_1.3 tidyselect_0.2.5
## [79] plyr_1.8.4 magrittr_1.5
## [81] R6_2.4.0 gplots_3.0.1.1
## [83] pillar_1.3.1 withr_2.1.2
## [85] fitdistrplus_1.0-14 survival_2.43-3
## [87] RCurl_1.95-4.12 tibble_2.1.1
## [89] future.apply_1.2.0 tsne_0.1-3
## [91] crayon_1.3.4 KernSmooth_2.23-15
## [93] plotly_4.9.0 rmarkdown_1.12
## [95] viridis_0.5.1 locfit_1.5-9.1
## [97] grid_3.5.3 data.table_1.12.2
## [99] metap_1.1 digest_0.6.18
## [101] tidyr_0.8.3 R.utils_2.8.0
## [103] munsell_0.5.0 beeswarm_0.2.3
## [105] viridisLite_0.3.0 vipor_0.4.5