From 8aeeabcea42ed11d1a0adc988a44af9a9a027048 Mon Sep 17 00:00:00 2001 From: Andrew Ghazi <6763470+andrewGhazi@users.noreply.github.com> Date: Fri, 6 Sep 2024 13:48:10 -0400 Subject: [PATCH 1/2] callout about data provenance --- episodes/eda_qc.Rmd | 5 ++++- episodes/intro-sce.Rmd | 2 +- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/episodes/eda_qc.Rmd b/episodes/eda_qc.Rmd index 0647b3c..e499b54 100644 --- a/episodes/eda_qc.Rmd +++ b/episodes/eda_qc.Rmd @@ -56,7 +56,6 @@ This is the same data we examined in the previous lesson. From the experiment, we expect to have only a few thousand cells, while we can see that we have data for more than 500,000 droplets. It is likely that most of these droplets are empty and are capturing only ambient or background RNA. - ```{r} library(DropletUtils) library(ggplot2) @@ -84,6 +83,10 @@ The distribution of total counts (called the unique molecular identifier or UMI A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this may unnecessarily discard libraries derived from cell types with low RNA content. +::: callout +Depending on your data source, identifying and discarding empty droplets may not be necessary. Some academic institutions have research cores dedicated to single cell work that perform the sample preparation and sequencing. Many of these cores will also perform empty droplet filtering and other initial QC steps. If the sequencing outputs were provided to you by someone else, make sure to communicate with them about what pre-processing steps have been performed, if any. +::: + :::: challenge What is the median number of total counts in the raw data? diff --git a/episodes/intro-sce.Rmd b/episodes/intro-sce.Rmd index acf2ba5..f27c601 100644 --- a/episodes/intro-sce.Rmd +++ b/episodes/intro-sce.Rmd @@ -45,7 +45,7 @@ install.packages("ggplot2") ``` In our case, however, we want to install Bioconductor packages. -These packages are located in a separate repository (see comments below) so we first install the `r CRANpkg("BiocManager")` package to easily connect to the Bioconductor servers. +These packages are located in a separate repository hosted by Bioconductor, so we first install the `r CRANpkg("BiocManager")` package to easily connect to the Bioconductor servers. ```{r, eval=FALSE} install.packages("BiocManager") From 3c5cfe83cbd897cf0f3345448219436d5e0cbe5c Mon Sep 17 00:00:00 2001 From: Andrew Ghazi <6763470+andrewGhazi@users.noreply.github.com> Date: Fri, 6 Sep 2024 16:04:24 -0400 Subject: [PATCH 2/2] minor eda wording changes --- episodes/eda_qc.Rmd | 110 ++++++++++++++++++++++++++++---------------- 1 file changed, 71 insertions(+), 39 deletions(-) diff --git a/episodes/eda_qc.Rmd b/episodes/eda_qc.Rmd index e499b54..d2d8acd 100644 --- a/episodes/eda_qc.Rmd +++ b/episodes/eda_qc.Rmd @@ -7,7 +7,7 @@ exercises: 15 # Minutes of exercises in the lesson :::::::::::::::::::::::::::::::::::::: questions - How do I examine the quality of single-cell data? -- What data visualizations should I use during quality-control in a single-cell analysis? +- What data visualizations should I use during quality control in a single-cell analysis? - How do I prepare single-cell data for analysis? :::::::::::::::::::::::::::::::::::::::::::::::: @@ -45,7 +45,7 @@ We start our analysis by selecting only sample 5, which contains the injected ce ```{r, message = FALSE} library(MouseGastrulationData) -sce <- WTChimeraData(samples=5, type="raw") +sce <- WTChimeraData(samples = 5, type = "raw") sce <- sce[[1]] sce ``` @@ -147,13 +147,13 @@ Retaining these low-quality samples in the analysis could be problematic as they - interfere with variance estimation and principal component analysis - contain contaminating transcripts from ambient RNA -To mitigate these problems, we can check a few quality-control metrics and, if needed, remove low-quality samples. +To mitigate these problems, we can check a few quality control (QC) metrics and, if needed, remove low-quality samples. -### Choice of quality-control metrics +### Choice of quality control metrics -There are many possible ways to define a set of quality-control metrics, see for instance [Cole 2019](learners/reference.md#litref). Here, we keep it simple and consider only: +There are many possible ways to define a set of quality control metrics, see for instance [Cole 2019](learners/reference.md#litref). Here, we keep it simple and consider only: -- the _library size_, defined as the total sum of counts across all relevant features for each cell; +- the _library size_, defined as the total sum of counts across all relevant features *for each cell*; - the number of expressed features in each cell, defined as the number of endogenous genes with non-zero counts for that cell; - the proportion of reads mapped to genes in the mitochondrial genome. @@ -171,17 +171,18 @@ chr.loc <- mapIds(EnsDb.Mmusculus.v79, keytype = "GENEID", column = "SEQNAME") -is.mito <- which(chr.loc=="MT") +is.mito <- which(chr.loc == "MT") ``` -We can use the `scuttle` package to compute a set of quality-control metrics, specifying that we want to use the mitochondrial genes as a special set of features. +We can use the `scuttle` package to compute a set of quality control metrics, specifying that we want to use the mitochondrial genes as a special set of features. ```{r} library(scuttle) -df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito)) -# include them in the object +df <- perCellQCMetrics(sce, subsets = list(Mito = is.mito)) + colData(sce) <- cbind(colData(sce), df) + colData(sce) ``` @@ -202,7 +203,8 @@ summary(df$subsets_Mito_percent) We can use the `perCellQCFilters` function to apply a set of common adaptive filters to identify low-quality cells. By default, we consider a value to be an outlier if it is more than 3 median absolute deviations (MADs) from the median in the "problematic" direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution. ```{r} -reasons <- perCellQCFilters(df, sub.fields="subsets_Mito_percent") +reasons <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent") + reasons sce$discard <- reasons$discard @@ -224,21 +226,31 @@ There are `r unname(table(sce$discard)[1])` cells that *haven't* been flagged to ### Diagnostic plots -It is always a good idea to check the distribution of the QC metrics and to visualize the cells that were removed, to identify possible problems with the procedure. -In particular, we expect to have few outliers and with a marked difference from "regular" cells (e.g., a bimodal distribution or a long tail). Moreover, if there are too many discarded cells, further exploration might be needed. +It is always a good idea to check the distribution of the QC metrics and to +visualize the cells that were removed, to identify possible problems with the +procedure. In particular, we expect to have few outliers and with a marked +difference from "regular" cells (e.g., a bimodal distribution or a long tail). +Moreover, if there are too many discarded cells, further exploration might be +needed. ```{r} library(scater) -plotColData(sce, y = "sum", colour_by = "discard") + ggtitle("Total count") -plotColData(sce, y = "detected", colour_by = "discard") + ggtitle("Detected features") -plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + ggtitle("Mito percent") +plotColData(sce, y = "sum", colour_by = "discard") + + labs(title = "Total count") + +plotColData(sce, y = "detected", colour_by = "discard") + + labs(title = "Detected features") + +plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + + labs(title = "Mito percent") + ``` While the univariate distribution of QC metrics can give some insight on the quality of the sample, often looking at the bivariate distribution of QC metrics is useful, e.g., to confirm that there are no cells with both large total counts and large mitochondrial counts, to ensure that we are not inadvertently removing high-quality cells that happen to be highly metabolically active. ```{r} -plotColData(sce, x="sum", y="subsets_Mito_percent", colour_by="discard") +plotColData(sce, x ="sum", y = "subsets_Mito_percent", colour_by = "discard") ``` It could also be a good idea to perform a differential expression analysis between retained and discarded cells to check wether we are removing an unusual cell population rather than low-quality libraries (see [Section 1.5 of OSCA advanced](https://bioconductor.org/books/release/OSCA.advanced/quality-control-redux.html#qc-discard-cell-types)). @@ -263,6 +275,7 @@ The _library size factor_ for each cell is directly proportional to its library ```{r} lib.sf <- librarySizeFactors(sce) + summary(lib.sf) sf_df = data.frame(size_factor = lib.sf) @@ -286,16 +299,21 @@ Note that while we focus on normalization by deconvolution here, many other meth ```{r} library(scran) + set.seed(100) + clust <- quickCluster(sce) + table(clust) ``` ```{r} -deconv.sf <- calculateSumFactors(sce, cluster=clust) +deconv.sf <- calculateSumFactors(sce, cluster = clust) + summary(deconv.sf) sf_df$deconv_sf = deconv.sf + sf_df$clust = clust ggplot(sf_df, aes(size_factor, deconv_sf)) + @@ -310,7 +328,9 @@ Once we have computed the size factors, we compute the normalized expression val ```{r} sizeFactors(sce) <- deconv.sf + sce <- logNormCounts(sce) + sce ``` @@ -320,7 +340,7 @@ Some sophisticated experiments perform additional steps so that they can estimat ::: solution -Spike-ins are deliberately introduced exogeneous RNA from an exotic or synthetic source at a known concentration. This provides a known signal to normalize to. Exotic or synthetic RNA (e.g. soil bacteria RNA in a study of human cells) is used in order to avoid confusing spike-in RNA with sample RNA. This has the obvious advantage of accounting for cell-wise variation, but adds additional sample-preparation work. +Spike-ins are deliberately-introduced exogeneous RNA from an exotic or synthetic source at a known concentration. This provides a known signal to normalize to. Exotic or synthetic RNA (e.g. soil bacteria RNA in a study of human cells) is used in order to avoid confusing spike-in RNA with sample RNA. This has the obvious advantage of accounting for cell-wise variation, but adds additional sample-preparation work. ::: @@ -334,16 +354,17 @@ The choice of genes to use in this calculation has a major impact on the results ### Quantifying per-gene variation -The simplest approach to feature selection is to select the most variable genes based on their log-normalized expression across the population. +The simplest approach to feature selection is to select the most variable genes based on their log-normalized expression across the population. This is motivated by practical idea that if we're going to try to explain variation in gene expression by biological factors, those genes need to have variance to explain. -Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. To account for this, the `modelGeneVar` function fits a trend to the variance with respect to abundance across all genes. +Calculation of the per-gene variance is simple but feature selection requires modeling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. To account for this, the `modelGeneVar` function fits a trend to the variance with respect to abundance across all genes. ```{r} dec.sce <- modelGeneVar(sce) + fit.sce <- metadata(dec.sce) mean_var_df = data.frame(mean = fit.sce$mean, - var = fit.sce$var) + var = fit.sce$var) ggplot(mean_var_df, aes(mean, var)) + geom_point() + @@ -358,10 +379,11 @@ The blue line represents the uninteresting "technical" variance for any given ge ### Selecting highly variable genes -The next step is to select the subset of HVGs to use in downstream analyses. A larger set will assure that we do not remove important genes, at the cost of potentially increasing noise. Typically, we restrict ourselves to the top $n$ genes, here we chose $n=1000$, but this choice should be guided by prior biological knowledge; for instance, we may expect that only about 10% of genes to be differentially expressed across our cell populations and hence select 10% of genes as higly variable (e.g., by setting `prop=0.1`). +The next step is to select the subset of HVGs to use in downstream analyses. A larger set will assure that we do not remove important genes, at the cost of potentially increasing noise. Typically, we restrict ourselves to the top $n$ genes, here we chose $n = 1000$, but this choice should be guided by prior biological knowledge; for instance, we may expect that only about 10% of genes to be differentially expressed across our cell populations and hence select 10% of genes as higly variable (e.g., by setting `prop = 0.1`). ```{r} -hvg.sce.var <- getTopHVGs(dec.sce, n=1000) +hvg.sce.var <- getTopHVGs(dec.sce, n = 1000) + head(hvg.sce.var) ``` @@ -386,11 +408,12 @@ Without getting into the technical details, one nice feature of PCA is that the One simple way to maximize our chance of capturing biological variation is by computing the PCs starting from the highly variable genes identified before. ```{r} -sce <- runPCA(sce, subset_row=hvg.sce.var) +sce <- runPCA(sce, subset_row = hvg.sce.var) + sce ``` -By default, `runPCA` computes the first 50 principal components. We can check how much original variability they explain. +By default, `runPCA` computes the first 50 principal components. We can check how much original variability they explain. These values are stored in the attributes of the `percentVar` reducedDim: ```{r} pct_var_df = data.frame(PC = 1:50, @@ -407,13 +430,13 @@ You can see the first two PCs capture the largest amount of variation, but in th And we can of course visualize the first 2-3 components, perhaps color-coding each point by an interesting feature, in this case the total number of UMIs per cell. ```{r} -plotPCA(sce, colour_by="sum") +plotPCA(sce, colour_by = "sum") ``` It can be helpful to compare pairs of PCs. This can be done with the `ncomponents` argument to `plotReducedDim()`. For example if one batch or cell type splits off on a particular PC, this can help visualize the effect of that. ```{r} -plotReducedDim(sce, dimred="PCA", ncomponents=3) +plotReducedDim(sce, dimred = "PCA", ncomponents = 3) ``` ### Non-linear methods @@ -424,13 +447,17 @@ These methods attempt to find a low-dimensional representation of the data that ```{r} set.seed(100) -sce <- runTSNE(sce, dimred="PCA") + +sce <- runTSNE(sce, dimred = "PCA") + plotTSNE(sce) ``` ```{r} set.seed(111) -sce <- runUMAP(sce, dimred="PCA") + +sce <- runUMAP(sce, dimred = "PCA") + plotUMAP(sce) ``` @@ -478,30 +505,35 @@ At a high level, the algorithm can be defined by the following steps: Intuitively, if a "cell" is surrounded only by simulated doublets is very likely to be a doublet itself. -This approach is implemented below, where we visualize the scores in a t-SNE plot. +This approach is implemented below using the `scDblFinder` library. We then visualize the scores in a t-SNE plot. ```{r} library(scDblFinder) + set.seed(100) -dbl.dens <- computeDoubletDensity(sce, subset.row=hvg.sce.var, - d=ncol(reducedDim(sce))) +dbl.dens <- computeDoubletDensity(sce, subset.row = hvg.sce.var, + d = ncol(reducedDim(sce))) summary(dbl.dens) sce$DoubletScore <- dbl.dens -plotTSNE(sce, colour_by="DoubletScore") +plotTSNE(sce, colour_by = "DoubletScore") ``` We can explicitly convert this into doublet calls by identifying large outliers for the score within each sample. Here we use the "griffiths" method to do so. ```{r} -dbl.calls <- doubletThresholding(data.frame(score=dbl.dens), - method="griffiths", returnType="call") +dbl.calls <- doubletThresholding(data.frame(score = dbl.dens), + method = "griffiths", + returnType = "call") summary(dbl.calls) + sce$doublet <- dbl.calls -plotColData(sce, y="DoubletScore", colour_by="doublet") -plotTSNE(sce, colour_by="doublet") + +plotColData(sce, y = "DoubletScore", colour_by = "doublet") + +plotTSNE(sce, colour_by = "doublet") ``` One way to determine whether a cell is in a real transient state or it is a doublet is to check the number of detected genes and total UMI counts. @@ -531,7 +563,7 @@ An alternative to PCA for dimensionality reduction is the [NewWave method](https :::::::::::::: hint -First subset the object to include only highly variable genes (`sce2 <- sce[hvg.sce.var,]`) and then apply the `NewWave` function to the new object setting `K=10` to obtain the first 10 dimensions. +First subset the object to include only highly variable genes (`sce2 <- sce[hvg.sce.var,]`) and then apply the `NewWave` function to the new object setting `K = 10` to obtain the first 10 dimensions. :::::::::::::::::::::::