Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exercises #24

Merged
merged 2 commits into from
Sep 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 75 additions & 40 deletions episodes/eda_qc.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ exercises: 15 # Minutes of exercises in the lesson
:::::::::::::::::::::::::::::::::::::: questions

- How do I examine the quality of single-cell data?
- What data visualizations should I use during quality-control in a single-cell analysis?
- What data visualizations should I use during quality control in a single-cell analysis?
- How do I prepare single-cell data for analysis?

::::::::::::::::::::::::::::::::::::::::::::::::
Expand Down Expand Up @@ -45,7 +45,7 @@ We start our analysis by selecting only sample 5, which contains the injected ce

```{r, message = FALSE}
library(MouseGastrulationData)
sce <- WTChimeraData(samples=5, type="raw")
sce <- WTChimeraData(samples = 5, type = "raw")
sce <- sce[[1]]
sce
```
Expand All @@ -56,7 +56,6 @@ This is the same data we examined in the previous lesson.

From the experiment, we expect to have only a few thousand cells, while we can see that we have data for more than 500,000 droplets. It is likely that most of these droplets are empty and are capturing only ambient or background RNA.


```{r}
library(DropletUtils)
library(ggplot2)
Expand Down Expand Up @@ -84,6 +83,10 @@ The distribution of total counts (called the unique molecular identifier or UMI

A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this may unnecessarily discard libraries derived from cell types with low RNA content.

::: callout
Depending on your data source, identifying and discarding empty droplets may not be necessary. Some academic institutions have research cores dedicated to single cell work that perform the sample preparation and sequencing. Many of these cores will also perform empty droplet filtering and other initial QC steps. If the sequencing outputs were provided to you by someone else, make sure to communicate with them about what pre-processing steps have been performed, if any.
:::

:::: challenge

What is the median number of total counts in the raw data?
Expand Down Expand Up @@ -144,13 +147,13 @@ Retaining these low-quality samples in the analysis could be problematic as they
- interfere with variance estimation and principal component analysis
- contain contaminating transcripts from ambient RNA

To mitigate these problems, we can check a few quality-control metrics and, if needed, remove low-quality samples.
To mitigate these problems, we can check a few quality control (QC) metrics and, if needed, remove low-quality samples.

### Choice of quality-control metrics
### Choice of quality control metrics

There are many possible ways to define a set of quality-control metrics, see for instance [Cole 2019](learners/reference.md#litref). Here, we keep it simple and consider only:
There are many possible ways to define a set of quality control metrics, see for instance [Cole 2019](learners/reference.md#litref). Here, we keep it simple and consider only:

- the _library size_, defined as the total sum of counts across all relevant features for each cell;
- the _library size_, defined as the total sum of counts across all relevant features *for each cell*;
- the number of expressed features in each cell, defined as the number of endogenous genes with non-zero counts for that cell;
- the proportion of reads mapped to genes in the mitochondrial genome.

Expand All @@ -168,17 +171,18 @@ chr.loc <- mapIds(EnsDb.Mmusculus.v79,
keytype = "GENEID",
column = "SEQNAME")

is.mito <- which(chr.loc=="MT")
is.mito <- which(chr.loc == "MT")
```

We can use the `scuttle` package to compute a set of quality-control metrics, specifying that we want to use the mitochondrial genes as a special set of features.
We can use the `scuttle` package to compute a set of quality control metrics, specifying that we want to use the mitochondrial genes as a special set of features.

```{r}
library(scuttle)
df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))

# include them in the object
df <- perCellQCMetrics(sce, subsets = list(Mito = is.mito))

colData(sce) <- cbind(colData(sce), df)

colData(sce)
```

Expand All @@ -199,7 +203,8 @@ summary(df$subsets_Mito_percent)
We can use the `perCellQCFilters` function to apply a set of common adaptive filters to identify low-quality cells. By default, we consider a value to be an outlier if it is more than 3 median absolute deviations (MADs) from the median in the "problematic" direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution.

```{r}
reasons <- perCellQCFilters(df, sub.fields="subsets_Mito_percent")
reasons <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent")

reasons

sce$discard <- reasons$discard
Expand All @@ -221,21 +226,31 @@ There are `r unname(table(sce$discard)[1])` cells that *haven't* been flagged to

### Diagnostic plots

It is always a good idea to check the distribution of the QC metrics and to visualize the cells that were removed, to identify possible problems with the procedure.
In particular, we expect to have few outliers and with a marked difference from "regular" cells (e.g., a bimodal distribution or a long tail). Moreover, if there are too many discarded cells, further exploration might be needed.
It is always a good idea to check the distribution of the QC metrics and to
visualize the cells that were removed, to identify possible problems with the
procedure. In particular, we expect to have few outliers and with a marked
difference from "regular" cells (e.g., a bimodal distribution or a long tail).
Moreover, if there are too many discarded cells, further exploration might be
needed.

```{r}
library(scater)

plotColData(sce, y = "sum", colour_by = "discard") + ggtitle("Total count")
plotColData(sce, y = "detected", colour_by = "discard") + ggtitle("Detected features")
plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + ggtitle("Mito percent")
plotColData(sce, y = "sum", colour_by = "discard") +
labs(title = "Total count")

plotColData(sce, y = "detected", colour_by = "discard") +
labs(title = "Detected features")

plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") +
labs(title = "Mito percent")

```

While the univariate distribution of QC metrics can give some insight on the quality of the sample, often looking at the bivariate distribution of QC metrics is useful, e.g., to confirm that there are no cells with both large total counts and large mitochondrial counts, to ensure that we are not inadvertently removing high-quality cells that happen to be highly metabolically active.

```{r}
plotColData(sce, x="sum", y="subsets_Mito_percent", colour_by="discard")
plotColData(sce, x ="sum", y = "subsets_Mito_percent", colour_by = "discard")
```

It could also be a good idea to perform a differential expression analysis between retained and discarded cells to check wether we are removing an unusual cell population rather than low-quality libraries (see [Section 1.5 of OSCA advanced](https://bioconductor.org/books/release/OSCA.advanced/quality-control-redux.html#qc-discard-cell-types)).
Expand All @@ -260,6 +275,7 @@ The _library size factor_ for each cell is directly proportional to its library

```{r}
lib.sf <- librarySizeFactors(sce)

summary(lib.sf)

sf_df = data.frame(size_factor = lib.sf)
Expand All @@ -283,16 +299,21 @@ Note that while we focus on normalization by deconvolution here, many other meth

```{r}
library(scran)

set.seed(100)

clust <- quickCluster(sce)

table(clust)
```

```{r}
deconv.sf <- calculateSumFactors(sce, cluster=clust)
deconv.sf <- calculateSumFactors(sce, cluster = clust)

summary(deconv.sf)

sf_df$deconv_sf = deconv.sf

sf_df$clust = clust

ggplot(sf_df, aes(size_factor, deconv_sf)) +
Expand All @@ -307,7 +328,9 @@ Once we have computed the size factors, we compute the normalized expression val

```{r}
sizeFactors(sce) <- deconv.sf

sce <- logNormCounts(sce)

sce
```

Expand All @@ -317,7 +340,7 @@ Some sophisticated experiments perform additional steps so that they can estimat

::: solution

Spike-ins are deliberately introduced exogeneous RNA from an exotic or synthetic source at a known concentration. This provides a known signal to normalize to. Exotic or synthetic RNA (e.g. soil bacteria RNA in a study of human cells) is used in order to avoid confusing spike-in RNA with sample RNA. This has the obvious advantage of accounting for cell-wise variation, but adds additional sample-preparation work.
Spike-ins are deliberately-introduced exogeneous RNA from an exotic or synthetic source at a known concentration. This provides a known signal to normalize to. Exotic or synthetic RNA (e.g. soil bacteria RNA in a study of human cells) is used in order to avoid confusing spike-in RNA with sample RNA. This has the obvious advantage of accounting for cell-wise variation, but adds additional sample-preparation work.

:::

Expand All @@ -331,16 +354,17 @@ The choice of genes to use in this calculation has a major impact on the results

### Quantifying per-gene variation

The simplest approach to feature selection is to select the most variable genes based on their log-normalized expression across the population.
The simplest approach to feature selection is to select the most variable genes based on their log-normalized expression across the population. This is motivated by practical idea that if we're going to try to explain variation in gene expression by biological factors, those genes need to have variance to explain.

Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. To account for this, the `modelGeneVar` function fits a trend to the variance with respect to abundance across all genes.
Calculation of the per-gene variance is simple but feature selection requires modeling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. To account for this, the `modelGeneVar` function fits a trend to the variance with respect to abundance across all genes.

```{r}
dec.sce <- modelGeneVar(sce)

fit.sce <- metadata(dec.sce)

mean_var_df = data.frame(mean = fit.sce$mean,
var = fit.sce$var)
var = fit.sce$var)

ggplot(mean_var_df, aes(mean, var)) +
geom_point() +
Expand All @@ -355,10 +379,11 @@ The blue line represents the uninteresting "technical" variance for any given ge

### Selecting highly variable genes

The next step is to select the subset of HVGs to use in downstream analyses. A larger set will assure that we do not remove important genes, at the cost of potentially increasing noise. Typically, we restrict ourselves to the top $n$ genes, here we chose $n=1000$, but this choice should be guided by prior biological knowledge; for instance, we may expect that only about 10% of genes to be differentially expressed across our cell populations and hence select 10% of genes as higly variable (e.g., by setting `prop=0.1`).
The next step is to select the subset of HVGs to use in downstream analyses. A larger set will assure that we do not remove important genes, at the cost of potentially increasing noise. Typically, we restrict ourselves to the top $n$ genes, here we chose $n = 1000$, but this choice should be guided by prior biological knowledge; for instance, we may expect that only about 10% of genes to be differentially expressed across our cell populations and hence select 10% of genes as higly variable (e.g., by setting `prop = 0.1`).

```{r}
hvg.sce.var <- getTopHVGs(dec.sce, n=1000)
hvg.sce.var <- getTopHVGs(dec.sce, n = 1000)

head(hvg.sce.var)
```

Expand All @@ -383,11 +408,12 @@ Without getting into the technical details, one nice feature of PCA is that the
One simple way to maximize our chance of capturing biological variation is by computing the PCs starting from the highly variable genes identified before.

```{r}
sce <- runPCA(sce, subset_row=hvg.sce.var)
sce <- runPCA(sce, subset_row = hvg.sce.var)

sce
```

By default, `runPCA` computes the first 50 principal components. We can check how much original variability they explain.
By default, `runPCA` computes the first 50 principal components. We can check how much original variability they explain. These values are stored in the attributes of the `percentVar` reducedDim:

```{r}
pct_var_df = data.frame(PC = 1:50,
Expand All @@ -404,13 +430,13 @@ You can see the first two PCs capture the largest amount of variation, but in th
And we can of course visualize the first 2-3 components, perhaps color-coding each point by an interesting feature, in this case the total number of UMIs per cell.

```{r}
plotPCA(sce, colour_by="sum")
plotPCA(sce, colour_by = "sum")
```

It can be helpful to compare pairs of PCs. This can be done with the `ncomponents` argument to `plotReducedDim()`. For example if one batch or cell type splits off on a particular PC, this can help visualize the effect of that.

```{r}
plotReducedDim(sce, dimred="PCA", ncomponents=3)
plotReducedDim(sce, dimred = "PCA", ncomponents = 3)
```

### Non-linear methods
Expand All @@ -421,13 +447,17 @@ These methods attempt to find a low-dimensional representation of the data that

```{r}
set.seed(100)
sce <- runTSNE(sce, dimred="PCA")

sce <- runTSNE(sce, dimred = "PCA")

plotTSNE(sce)
```

```{r}
set.seed(111)
sce <- runUMAP(sce, dimred="PCA")

sce <- runUMAP(sce, dimred = "PCA")

plotUMAP(sce)
```

Expand Down Expand Up @@ -475,30 +505,35 @@ At a high level, the algorithm can be defined by the following steps:

Intuitively, if a "cell" is surrounded only by simulated doublets is very likely to be a doublet itself.

This approach is implemented below, where we visualize the scores in a t-SNE plot.
This approach is implemented below using the `scDblFinder` library. We then visualize the scores in a t-SNE plot.

```{r}
library(scDblFinder)

set.seed(100)

dbl.dens <- computeDoubletDensity(sce, subset.row=hvg.sce.var,
d=ncol(reducedDim(sce)))
dbl.dens <- computeDoubletDensity(sce, subset.row = hvg.sce.var,
d = ncol(reducedDim(sce)))
summary(dbl.dens)

sce$DoubletScore <- dbl.dens

plotTSNE(sce, colour_by="DoubletScore")
plotTSNE(sce, colour_by = "DoubletScore")
```

We can explicitly convert this into doublet calls by identifying large outliers for the score within each sample. Here we use the "griffiths" method to do so.

```{r}
dbl.calls <- doubletThresholding(data.frame(score=dbl.dens),
method="griffiths", returnType="call")
dbl.calls <- doubletThresholding(data.frame(score = dbl.dens),
method = "griffiths",
returnType = "call")
summary(dbl.calls)

sce$doublet <- dbl.calls
plotColData(sce, y="DoubletScore", colour_by="doublet")
plotTSNE(sce, colour_by="doublet")

plotColData(sce, y = "DoubletScore", colour_by = "doublet")

plotTSNE(sce, colour_by = "doublet")
```

One way to determine whether a cell is in a real transient state or it is a doublet is to check the number of detected genes and total UMI counts.
Expand Down Expand Up @@ -528,7 +563,7 @@ An alternative to PCA for dimensionality reduction is the [NewWave method](https

:::::::::::::: hint

First subset the object to include only highly variable genes (`sce2 <- sce[hvg.sce.var,]`) and then apply the `NewWave` function to the new object setting `K=10` to obtain the first 10 dimensions.
First subset the object to include only highly variable genes (`sce2 <- sce[hvg.sce.var,]`) and then apply the `NewWave` function to the new object setting `K = 10` to obtain the first 10 dimensions.

:::::::::::::::::::::::

Expand Down
2 changes: 1 addition & 1 deletion episodes/intro-sce.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ install.packages("ggplot2")
```

In our case, however, we want to install Bioconductor packages.
These packages are located in a separate repository (see comments below) so we first install the `r CRANpkg("BiocManager")` package to easily connect to the Bioconductor servers.
These packages are located in a separate repository hosted by Bioconductor, so we first install the `r CRANpkg("BiocManager")` package to easily connect to the Bioconductor servers.

```{r, eval=FALSE}
install.packages("BiocManager")
Expand Down
Loading