diff --git a/aio.html b/aio.html index 35e727a..132445f 100644 --- a/aio.html +++ b/aio.html @@ -435,8 +435,8 @@
In our case, however, we want to install Bioconductor packages. These -packages are located in a separate repository (see comments below) so we -first install the BiocManager +packages are located in a separate repository hosted by Bioconductor, so +we first install the BiocManager package to easily connect to the Bioconductor servers.
library(MouseGastrulationData)
-sce <- WTChimeraData(samples=5, type="raw")
+sce <- WTChimeraData(samples = 5, type = "raw")
sce <- sce[[1]]
sce
Depending on your data source, identifying and discarding empty +droplets may not be necessary. Some academic institutions have research +cores dedicated to single cell work that perform the sample preparation +and sequencing. Many of these cores will also perform empty droplet +filtering and other initial QC steps. If the sequencing outputs were +provided to you by someone else, make sure to communicate with them +about what pre-processing steps have been performed, if any.
+Whenever your code involves the generation of random numbers, it’s a good practice to set the random seed in R with @@ -1130,17 +1147,17 @@
To mitigate these problems, we can check a few quality-control +
To mitigate these problems, we can check a few quality control (QC) metrics and, if needed, remove low-quality samples.
There are many possible ways to define a set of quality-control +
There are many possible ways to define a set of quality control metrics, see for instance Cole 2019. Here, we keep it simple and consider only:
We can use the scuttle
package to compute a set of
-quality-control metrics, specifying that we want to use the
+quality control metrics, specifying that we want to use the
mitochondrial genes as a special set of features.
library(scuttle)
-df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
-# include them in the object
+df <- perCellQCMetrics(sce, subsets = list(Mito = is.mito))
+
colData(sce) <- cbind(colData(sce), df)
+
colData(sce)
-reasons <- perCellQCFilters(df, sub.fields="subsets_Mito_percent")
+reasons <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent")
+
reasons
library(scater)
-plotColData(sce, y = "sum", colour_by = "discard") + ggtitle("Total count")
+plotColData(sce, y = "sum", colour_by = "discard") +
+ labs(title = "Total count")
-plotColData(sce, y = "detected", colour_by = "discard") + ggtitle("Detected features")
+plotColData(sce, y = "detected", colour_by = "discard") +
+ labs(title = "Detected features")
-plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + ggtitle("Mito percent")
+plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") +
+ labs(title = "Mito percent")
While the univariate distribution of QC metrics can give some insight on the quality of the sample, often looking at the bivariate @@ -1391,7 +1413,7 @@
-plotColData(sce, x="sum", y="subsets_Mito_percent", colour_by="discard")
+plotColData(sce, x ="sum", y = "subsets_Mito_percent", colour_by = "discard")
It could also be a good idea to perform a differential expression analysis between retained and discarded cells to check wether we are @@ -1462,6 +1484,7 @@
lib.sf <- librarySizeFactors(sce)
+
summary(lib.sf)
library(scran)
+
set.seed(100)
+
clust <- quickCluster(sce)
+
table(clust)
-deconv.sf <- calculateSumFactors(sce, cluster=clust)
+deconv.sf <- calculateSumFactors(sce, cluster = clust)
+
summary(deconv.sf)
sf_df$deconv_sf = deconv.sf
+
sf_df$clust = clust
ggplot(sf_df, aes(size_factor, deconv_sf)) +
@@ -1556,7 +1584,9 @@ R
sizeFactors(sce) <- deconv.sf
+
sce <- logNormCounts(sce)
+
sce
Spike-ins are deliberately introduced exogeneous RNA from an exotic +
Spike-ins are deliberately-introduced exogeneous RNA from an exotic or synthetic source at a known concentration. This provides a known signal to normalize to. Exotic or synthetic RNA (e.g. soil bacteria RNA in a study of human cells) is used in order to avoid confusing spike-in @@ -1625,9 +1655,11 @@
dec.sce <- modelGeneVar(sce)
+
fit.sce <- metadata(dec.sce)
mean_var_df = data.frame(mean = fit.sce$mean,
- var = fit.sce$var)
+ var = fit.sce$var)
ggplot(mean_var_df, aes(mean, var)) +
geom_point() +
@@ -1661,17 +1694,18 @@ Selecting highly variable genes\(n\)
-genes, here we chose \(n=1000\), but
+genes, here we chose \(n = 1000\), but
this choice should be guided by prior biological knowledge; for
instance, we may expect that only about 10% of genes to be
differentially expressed across our cell populations and hence select
10% of genes as higly variable (e.g., by setting
-prop=0.1
).
+prop = 0.1
).
R
-hvg.sce.var <- getTopHVGs(dec.sce, n=1000)
+hvg.sce.var <- getTopHVGs(dec.sce, n = 1000)
+
head(hvg.sce.var)
@@ -1735,7 +1769,8 @@ Principal Component Analysis (PCA)R
-sce <- runPCA(sce, subset_row=hvg.sce.var)
+sce <- runPCA(sce, subset_row = hvg.sce.var)
+
sce
@@ -1756,7 +1791,9 @@ OUTPUT<
altExpNames(0):
By default, runPCA
computes the first 50 principal
-components. We can check how much original variability they explain.
+components. We can check how much original variability they explain.
+These values are stored in the attributes of the percentVar
+reducedDim:
R
@@ -1779,7 +1816,7 @@ RR
-plotPCA(sce, colour_by="sum")
+plotPCA(sce, colour_by = "sum")
It can be helpful to compare pairs of PCs. This can be done with the
ncomponents
argument to plotReducedDim()
. For
@@ -1789,7 +1826,7 @@
RR
-plotReducedDim(sce, dimred="PCA", ncomponents=3)
+plotReducedDim(sce, dimred = "PCA", ncomponents = 3)
set.seed(100)
-sce <- runTSNE(sce, dimred="PCA")
+
+sce <- runTSNE(sce, dimred = "PCA")
+
plotTSNE(sce)
set.seed(111)
-sce <- runUMAP(sce, dimred="PCA")
+
+sce <- runUMAP(sce, dimred = "PCA")
+
plotUMAP(sce)
It is easy to over-interpret t-SNE and UMAP plots. We note that the @@ -1879,7 +1920,7 @@
Mathematically, this would require the data to fall on a two-dimensional plane (for linear methods like PCA) or a smooth 2D @@ -1930,17 +1971,18 @@
library(scDblFinder)
+
set.seed(100)
-dbl.dens <- computeDoubletDensity(sce, subset.row=hvg.sce.var,
- d=ncol(reducedDim(sce)))
+dbl.dens <- computeDoubletDensity(sce, subset.row = hvg.sce.var,
+ d = ncol(reducedDim(sce)))
summary(dbl.dens)
sce$DoubletScore <- dbl.dens
-plotTSNE(sce, colour_by="DoubletScore")
+plotTSNE(sce, colour_by = "DoubletScore")
We can explicitly convert this into doublet calls by identifying large outliers for the score within each sample. Here we use the @@ -1964,8 +2006,9 @@
-dbl.calls <- doubletThresholding(data.frame(score=dbl.dens),
- method="griffiths", returnType="call")
+dbl.calls <- doubletThresholding(data.frame(score = dbl.dens),
+ method = "griffiths",
+ returnType = "call")
summary(dbl.calls)
sce$doublet <- dbl.calls
-plotColData(sce, y="DoubletScore", colour_by="doublet")
+
+plotColData(sce, y = "DoubletScore", colour_by = "doublet")
-plotTSNE(sce, colour_by="doublet")
+plotTSNE(sce, colour_by = "doublet")
One way to determine whether a cell is in a real transient state or it is a doublet is to check the number of detected genes and total UMI @@ -2046,12 +2090,12 @@
First subset the object to include only highly variable genes
(sce2 <- sce[hvg.sce.var,]
) and then apply the
NewWave
function to the new object setting
-K=10
to obtain the first 10 dimensions.
K = 10
to obtain the first 10 dimensions.
We see in the help documentation for ?clusterCells
that
all of the clustering algorithm details are handled through the
@@ -2474,7 +2518,7 @@
One important reason why is because averages over all other clusters can be sensitive to the cell type composition. If a rare cell type shows @@ -2768,7 +2812,7 @@
Use BiocParallel
and the BPPARAM
argument!
This example will set it to use four cores on your laptop, but you can
@@ -3227,7 +3271,7 @@
The example that jumps out most strongly to the eye is ExE endoderm, which doesn’t show clear separate modes. Simultaneously, Endothelium @@ -3376,7 +3420,7 @@
The NNGraphParam
constructor has an argument
cluster.args
. This allows to specify arguments passed on to
@@ -3393,7 +3437,7 @@
TODO
Use the goana()
function from the limma
package to identify GO BP terms that are overrepresented in the list of
@@ -3436,7 +3480,7 @@
TODO
TODO
From ?MulticoreParam
:
@@ -5330,7 +5374,7 @@Challenge
-+This code block calculates the exact PCA coordinates. Another thing to note: PC vectors are only identified up to a sign flip. We can see @@ -5727,7 +5771,7 @@
Exercise 1: Out of memory representation
-+See the
HDF5Array
function for reading from HDF5 and thewriteHDF5Array
function for writing to HDF5 from the HDF5Array @@ -5741,7 +5785,7 @@Give me a hint
-+R @@ -5820,7 +5864,7 @@
Exercise 2: Parallelization
-+Use the function
@@ -5833,7 +5877,7 @@system.time
to obtain the runtime of each job.Give me a hint
-+R @@ -6402,7 +6446,7 @@
Exercise 1
-+R @@ -6435,7 +6479,7 @@
Exercise 2
-+R @@ -6467,7 +6511,7 @@
Exercise 3
-+R @@ -6508,7 +6552,7 @@
Exercise 4
-+R diff --git a/cell_type_annotation.html b/cell_type_annotation.html index d19f7f2..8c21e85 100644 --- a/cell_type_annotation.html +++ b/cell_type_annotation.html @@ -507,7 +507,7 @@
Challenge
-+We see in the help documentation for
?clusterCells
that all of the clustering algorithm details are handled through the @@ -656,7 +656,7 @@Challenge
-+One important reason why is because averages over all other clusters can be sensitive to the cell type composition. If a rare cell type shows @@ -947,7 +947,7 @@
Challenge
-+Use
BiocParallel
and theBPPARAM
argument! This example will set it to use four cores on your laptop, but you can @@ -1405,7 +1405,7 @@Challenge
-+The example that jumps out most strongly to the eye is ExE endoderm, which doesn’t show clear separate modes. Simultaneously, Endothelium @@ -1550,7 +1550,7 @@
Exercise 1: Clustering
-+The
NNGraphParam
constructor has an argumentcluster.args
. This allows to specify arguments passed on to @@ -1567,7 +1567,7 @@Give me a hint
-+@@ -1596,7 +1596,7 @@TODO
Exercise 2: Cluster annotation
-+Use the
goana()
function from the limma package to identify GO BP terms that are overrepresented in the list of @@ -1610,7 +1610,7 @@Give me a hint
-+@@ -1640,7 +1640,7 @@TODO
Exercise 3: Workflow
-+@@ -453,6 +453,23 @@diff --git a/eda_qc.html b/eda_qc.html index 5e2321c..06217c8 100644 --- a/eda_qc.html +++ b/eda_qc.html @@ -353,7 +353,7 @@TODO
Overview
@@ -393,7 +393,7 @@Questions
- How do I examine the quality of single-cell data?
-- What data visualizations should I use during quality-control in a +
- What data visualizations should I use during quality control in a single-cell analysis?
- How do I prepare single-cell data for analysis?
R
library(MouseGastrulationData) -sce <- WTChimeraData(samples=5, type="raw") +sce <- WTChimeraData(samples = 5, type = "raw") sce <- sce[[1]] sce
R +
++ ++++Callout
+++Depending on your data source, identifying and discarding empty +droplets may not be necessary. Some academic institutions have research +cores dedicated to single cell work that perform the sample preparation +and sequencing. Many of these cores will also perform empty droplet +filtering and other initial QC steps. If the sequencing outputs were +provided to you by someone else, make sure to communicate with them +about what pre-processing steps have been performed, if any.
+@@ -469,7 +486,7 @@Challenge
-+R @@ -548,7 +565,7 @@
OUTPUT< -
+Whenever your code involves the generation of random numbers, it’s a good practice to set the random seed in R with @@ -575,15 +592,15 @@
interfere with variance estimation and principal component analysiscontain contaminating transcripts from ambient RNA -To mitigate these problems, we can check a few quality-control +
To mitigate these problems, we can check a few quality control (QC) metrics and, if needed, remove low-quality samples.
-Choice of quality-control metrics
-There are many possible ways to define a set of quality-control +
Choice of quality control metrics
+There are many possible ways to define a set of quality control metrics, see for instance Cole 2019. Here, we keep it simple and consider only:
- the library size, defined as the total sum of counts across -all relevant features for each cell;
+all relevant features for each cell;- the number of expressed features in each cell, defined as the number of endogenous genes with non-zero counts for that cell;
- the proportion of reads mapped to genes in the mitochondrial @@ -616,20 +633,21 @@
R keytype = "GENEID", column = "SEQNAME") -is.mito <- which(chr.loc=="MT") +is.mito <- which(chr.loc == "MT")
We can use the
scuttle
package to compute a set of -quality-control metrics, specifying that we want to use the +quality control metrics, specifying that we want to use the mitochondrial genes as a special set of features.R
library(scuttle) -df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito)) -# include them in the object +df <- perCellQCMetrics(sce, subsets = list(Mito = is.mito)) + colData(sce) <- cbind(colData(sce), df) + colData(sce)
@@ -728,7 +746,8 @@OUTPUT<
R
-reasons <- perCellQCFilters(df, sub.fields="subsets_Mito_percent") +
reasons <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent") + reasons
@@ -772,7 +791,7 @@Challenge
-+R @@ -808,19 +827,22 @@
R
library(scater) -plotColData(sce, y = "sum", colour_by = "discard") + ggtitle("Total count")
+plotColData(sce, y = "sum", colour_by = "discard") + + labs(title = "Total count")R
-+plotColData(sce, y = "detected", colour_by = "discard") + ggtitle("Detected features")
plotColData(sce, y = "detected", colour_by = "discard") + + labs(title = "Detected features")
R
-+plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + ggtitle("Mito percent")
plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + + labs(title = "Mito percent")
While the univariate distribution of QC metrics can give some insight on the quality of the sample, often looking at the bivariate @@ -832,7 +854,7 @@
RR
-+plotColData(sce, x="sum", y="subsets_Mito_percent", colour_by="discard")
plotColData(sce, x ="sum", y = "subsets_Mito_percent", colour_by = "discard")
It could also be a good idea to perform a differential expression analysis between retained and discarded cells to check wether we are @@ -901,6 +923,7 @@
R
lib.sf <- librarySizeFactors(sce) + summary(lib.sf)
@@ -946,8 +969,11 @@R
library(scran) + set.seed(100) + clust <- quickCluster(sce) + table(clust)
@@ -961,7 +987,8 @@OUTPUT<
R
-deconv.sf <- calculateSumFactors(sce, cluster=clust) +
deconv.sf <- calculateSumFactors(sce, cluster = clust) + summary(deconv.sf)
@@ -975,6 +1002,7 @@R
sf_df$deconv_sf = deconv.sf + sf_df$clust = clust ggplot(sf_df, aes(size_factor, deconv_sf)) + @@ -994,7 +1022,9 @@
R
sizeFactors(sce) <- deconv.sf + sce <- logNormCounts(sce) + sce
@@ -1033,9 +1063,9 @@Challenge
-+-Spike-ins are deliberately introduced exogeneous RNA from an exotic +
Spike-ins are deliberately-introduced exogeneous RNA from an exotic or synthetic source at a known concentration. This provides a known signal to normalize to. Exotic or synthetic RNA (e.g. soil bacteria RNA in a study of human cells) is used in order to avoid confusing spike-in @@ -1060,9 +1090,11 @@
Show me the solution
Quantifying per-gene variation
The simplest approach to feature selection is to select the most variable genes based on their log-normalized expression across the -population.
+population. This is motivated by practical idea that if we’re going to +try to explain variation in gene expression by biological factors, those +genes need to have variance to explain.Calculation of the per-gene variance is simple but feature selection -requires modelling of the mean-variance relationship. The +requires modeling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. To account @@ -1073,10 +1105,11 @@
R
dec.sce <- modelGeneVar(sce) + fit.sce <- metadata(dec.sce) mean_var_df = data.frame(mean = fit.sce$mean, - var = fit.sce$var) + var = fit.sce$var) ggplot(mean_var_df, aes(mean, var)) + geom_point() + @@ -1095,17 +1128,18 @@
Selecting highly variable genes\(n\) -genes, here we chose \(n=1000\), but +genes, here we chose \(n = 1000\), but this choice should be guided by prior biological knowledge; for instance, we may expect that only about 10% of genes to be differentially expressed across our cell populations and hence select 10% of genes as higly variable (e.g., by setting -
prop=0.1
). +prop = 0.1
).R
-hvg.sce.var <- getTopHVGs(dec.sce, n=1000) +
hvg.sce.var <- getTopHVGs(dec.sce, n = 1000) + head(hvg.sce.var)
@@ -1166,7 +1200,8 @@Principal Component Analysis (PCA)R
-sce <- runPCA(sce, subset_row=hvg.sce.var) +
sce <- runPCA(sce, subset_row = hvg.sce.var) + sce
@@ -1187,7 +1222,9 @@OUTPUT< altExpNames(0):
By default,
+components. We can check how much original variability they explain. +These values are stored in the attributes of therunPCA
computes the first 50 principal -components. We can check how much original variability they explain.percentVar
+reducedDim:R
@@ -1210,7 +1247,7 @@RR
-+plotPCA(sce, colour_by="sum")
plotPCA(sce, colour_by = "sum")
It can be helpful to compare pairs of PCs. This can be done with the
ncomponents
argument toplotReducedDim()
. For @@ -1220,7 +1257,7 @@RR
-+plotReducedDim(sce, dimred="PCA", ncomponents=3)
plotReducedDim(sce, dimred = "PCA", ncomponents = 3)
@@ -1237,7 +1274,9 @@R
set.seed(100) -sce <- runTSNE(sce, dimred="PCA") + +sce <- runTSNE(sce, dimred = "PCA") + plotTSNE(sce)
@@ -1245,7 +1284,9 @@R
set.seed(111) -sce <- runUMAP(sce, dimred="PCA") + +sce <- runUMAP(sce, dimred = "PCA") + plotUMAP(sce)
It is easy to over-interpret t-SNE and UMAP plots. We note that the @@ -1308,7 +1349,7 @@
Challenge
-+Mathematically, this would require the data to fall on a two-dimensional plane (for linear methods like PCA) or a smooth 2D @@ -1354,17 +1395,18 @@
Computing doublet densities
R
library(scDblFinder) + set.seed(100) -dbl.dens <- computeDoubletDensity(sce, subset.row=hvg.sce.var, - d=ncol(reducedDim(sce))) +dbl.dens <- computeDoubletDensity(sce, subset.row = hvg.sce.var, + d = ncol(reducedDim(sce))) summary(dbl.dens)
@@ -1379,7 +1421,7 @@R
sce$DoubletScore <- dbl.dens -plotTSNE(sce, colour_by="DoubletScore")
+plotTSNE(sce, colour_by = "DoubletScore")We can explicitly convert this into doublet calls by identifying large outliers for the score within each sample. Here we use the @@ -1388,8 +1430,9 @@
RR
-dbl.calls <- doubletThresholding(data.frame(score=dbl.dens), - method="griffiths", returnType="call") +
dbl.calls <- doubletThresholding(data.frame(score = dbl.dens), + method = "griffiths", + returnType = "call") summary(dbl.calls)
@@ -1403,13 +1446,14 @@R
+ +plotColData(sce, y = "DoubletScore", colour_by = "doublet")sce$doublet <- dbl.calls -plotColData(sce, y="DoubletScore", colour_by="doublet")
R
-+plotTSNE(sce, colour_by="doublet")
plotTSNE(sce, colour_by = "doublet")
One way to determine whether a cell is in a real transient state or it is a doublet is to check the number of detected genes and total UMI @@ -1468,12 +1512,12 @@
Exercise 2: Dimensionality Reduction
-+diff --git a/hca.html b/hca.html index 19b0c8f..5497ff8 100644 --- a/hca.html +++ b/hca.html @@ -788,7 +788,7 @@First subset the object to include only highly variable genes (
+sce2 <- sce[hvg.sce.var,]
) and then apply theNewWave
function to the new object setting -K=10
to obtain the first 10 dimensions.K = 10
to obtain the first 10 dimensions.Exercise 1
-+R @@ -821,7 +821,7 @@
Exercise 2
-+R @@ -853,7 +853,7 @@
Exercise 3
-+R @@ -894,7 +894,7 @@
Exercise 4
-+R diff --git a/instructor/aio.html b/instructor/aio.html index 330241d..5fab750 100644 --- a/instructor/aio.html +++ b/instructor/aio.html @@ -438,8 +438,8 @@
Rinstall.packages("ggplot2")
In our case, however, we want to install Bioconductor packages. These -packages are located in a separate repository (see comments below) so we -first install the BiocManager +packages are located in a separate repository hosted by Bioconductor, so +we first install the BiocManager package to easily connect to the Bioconductor servers.
@@ -1007,6 +1007,23 @@R @@ -898,7 +898,7 @@
Overview
Questions
@@ -945,7 +945,7 @@
- How do I examine the quality of single-cell data?
-- What data visualizations should I use during quality-control in a +
- What data visualizations should I use during quality control in a single-cell analysis?
- How do I prepare single-cell data for analysis?
R
library(MouseGastrulationData) -sce <- WTChimeraData(samples=5, type="raw") +sce <- WTChimeraData(samples = 5, type = "raw") sce <- sce[[1]] sce
R +
++ ++++Callout
+++Depending on your data source, identifying and discarding empty +droplets may not be necessary. Some academic institutions have research +cores dedicated to single cell work that perform the sample preparation +and sequencing. Many of these cores will also perform empty droplet +filtering and other initial QC steps. If the sequencing outputs were +provided to you by someone else, make sure to communicate with them +about what pre-processing steps have been performed, if any.
+@@ -1023,7 +1040,7 @@Challenge
-+R @@ -1103,7 +1120,7 @@
OUTPUT< -
+Whenever your code involves the generation of random numbers, it’s a good practice to set the random seed in R with @@ -1134,17 +1151,17 @@
contain contaminating transcripts from ambient RNA -To mitigate these problems, we can check a few quality-control +
To mitigate these problems, we can check a few quality control (QC) metrics and, if needed, remove low-quality samples.
-Choice of quality-control metrics +
Choice of quality control metrics
-There are many possible ways to define a set of quality-control +
There are many possible ways to define a set of quality control metrics, see for instance Cole 2019. Here, we keep it simple and consider only:
- the library size, defined as the total sum of counts across -all relevant features for each cell;
+all relevant features for each cell;- the number of expressed features in each cell, defined as the number of endogenous genes with non-zero counts for that cell;
- the proportion of reads mapped to genes in the mitochondrial @@ -1178,20 +1195,21 @@
R keytype = "GENEID", column = "SEQNAME") -is.mito <- which(chr.loc=="MT") +is.mito <- which(chr.loc == "MT")
We can use the
scuttle
package to compute a set of -quality-control metrics, specifying that we want to use the +quality control metrics, specifying that we want to use the mitochondrial genes as a special set of features.R
library(scuttle) -df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito)) -# include them in the object +df <- perCellQCMetrics(sce, subsets = list(Mito = is.mito)) + colData(sce) <- cbind(colData(sce), df) + colData(sce)
@@ -1290,7 +1308,8 @@OUTPUT<
R
-reasons <- perCellQCFilters(df, sub.fields="subsets_Mito_percent") +
reasons <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent") + reasons
@@ -1334,7 +1353,7 @@Challenge
-+R @@ -1371,19 +1390,22 @@
R
library(scater) -plotColData(sce, y = "sum", colour_by = "discard") + ggtitle("Total count")
+plotColData(sce, y = "sum", colour_by = "discard") + + labs(title = "Total count")R
-+plotColData(sce, y = "detected", colour_by = "discard") + ggtitle("Detected features")
plotColData(sce, y = "detected", colour_by = "discard") + + labs(title = "Detected features")
R
-+plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + ggtitle("Mito percent")
plotColData(sce, y = "subsets_Mito_percent", colour_by = "discard") + + labs(title = "Mito percent")
While the univariate distribution of QC metrics can give some insight on the quality of the sample, often looking at the bivariate @@ -1395,7 +1417,7 @@
RR
-+plotColData(sce, x="sum", y="subsets_Mito_percent", colour_by="discard")
plotColData(sce, x ="sum", y = "subsets_Mito_percent", colour_by = "discard")
It could also be a good idea to perform a differential expression analysis between retained and discarded cells to check wether we are @@ -1466,6 +1488,7 @@
R
lib.sf <- librarySizeFactors(sce) + summary(lib.sf)
@@ -1512,8 +1535,11 @@R
library(scran) + set.seed(100) + clust <- quickCluster(sce) + table(clust)
@@ -1527,7 +1553,8 @@OUTPUT<
R
-deconv.sf <- calculateSumFactors(sce, cluster=clust) +
deconv.sf <- calculateSumFactors(sce, cluster = clust) + summary(deconv.sf)
@@ -1541,6 +1568,7 @@R
sf_df$deconv_sf = deconv.sf + sf_df$clust = clust ggplot(sf_df, aes(size_factor, deconv_sf)) + @@ -1560,7 +1588,9 @@
R
sizeFactors(sce) <- deconv.sf + sce <- logNormCounts(sce) + sce
@@ -1599,9 +1629,9 @@Challenge
-+@@ -1812,7 +1849,9 @@-Spike-ins are deliberately introduced exogeneous RNA from an exotic +
Spike-ins are deliberately-introduced exogeneous RNA from an exotic or synthetic source at a known concentration. This provides a known signal to normalize to. Exotic or synthetic RNA (e.g. soil bacteria RNA in a study of human cells) is used in order to avoid confusing spike-in @@ -1629,9 +1659,11 @@
Quantifying per-gene variationR
dec.sce <- modelGeneVar(sce) + fit.sce <- metadata(dec.sce) mean_var_df = data.frame(mean = fit.sce$mean, - var = fit.sce$var) + var = fit.sce$var) ggplot(mean_var_df, aes(mean, var)) + geom_point() + @@ -1665,17 +1698,18 @@
Selecting highly variable genes\(n\) -genes, here we chose \(n=1000\), but +genes, here we chose \(n = 1000\), but this choice should be guided by prior biological knowledge; for instance, we may expect that only about 10% of genes to be differentially expressed across our cell populations and hence select 10% of genes as higly variable (e.g., by setting -
prop=0.1
). +prop = 0.1
).R
-hvg.sce.var <- getTopHVGs(dec.sce, n=1000) +
hvg.sce.var <- getTopHVGs(dec.sce, n = 1000) + head(hvg.sce.var)
@@ -1739,7 +1773,8 @@Principal Component Analysis (PCA)R
-sce <- runPCA(sce, subset_row=hvg.sce.var) +
sce <- runPCA(sce, subset_row = hvg.sce.var) + sce
@@ -1760,7 +1795,9 @@OUTPUT< altExpNames(0):
By default,
+components. We can check how much original variability they explain. +These values are stored in the attributes of therunPCA
computes the first 50 principal -components. We can check how much original variability they explain.percentVar
+reducedDim:R
@@ -1783,7 +1820,7 @@RR
-+plotPCA(sce, colour_by="sum")
plotPCA(sce, colour_by = "sum")
It can be helpful to compare pairs of PCs. This can be done with the
ncomponents
argument toplotReducedDim()
. For @@ -1793,7 +1830,7 @@RR
-+plotReducedDim(sce, dimred="PCA", ncomponents=3)
plotReducedDim(sce, dimred = "PCA", ncomponents = 3)
R
set.seed(100) -sce <- runTSNE(sce, dimred="PCA") + +sce <- runTSNE(sce, dimred = "PCA") + plotTSNE(sce)
@@ -1820,7 +1859,9 @@R
set.seed(111) -sce <- runUMAP(sce, dimred="PCA") + +sce <- runUMAP(sce, dimred = "PCA") + plotUMAP(sce)
It is easy to over-interpret t-SNE and UMAP plots. We note that the @@ -1883,7 +1924,7 @@
Challenge
-+Mathematically, this would require the data to fall on a two-dimensional plane (for linear methods like PCA) or a smooth 2D @@ -1934,17 +1975,18 @@
Computing doublet densities
R
library(scDblFinder) + set.seed(100) -dbl.dens <- computeDoubletDensity(sce, subset.row=hvg.sce.var, - d=ncol(reducedDim(sce))) +dbl.dens <- computeDoubletDensity(sce, subset.row = hvg.sce.var, + d = ncol(reducedDim(sce))) summary(dbl.dens)
@@ -1959,7 +2001,7 @@R
sce$DoubletScore <- dbl.dens -plotTSNE(sce, colour_by="DoubletScore")
+plotTSNE(sce, colour_by = "DoubletScore")We can explicitly convert this into doublet calls by identifying large outliers for the score within each sample. Here we use the @@ -1968,8 +2010,9 @@
RR
-dbl.calls <- doubletThresholding(data.frame(score=dbl.dens), - method="griffiths", returnType="call") +
dbl.calls <- doubletThresholding(data.frame(score = dbl.dens), + method = "griffiths", + returnType = "call") summary(dbl.calls)
@@ -1983,13 +2026,14 @@R
+ +plotColData(sce, y = "DoubletScore", colour_by = "doublet")sce$doublet <- dbl.calls -plotColData(sce, y="DoubletScore", colour_by="doublet")
R
-+plotTSNE(sce, colour_by="doublet")
plotTSNE(sce, colour_by = "doublet")
One way to determine whether a cell is in a real transient state or it is a doublet is to check the number of detected genes and total UMI @@ -2050,12 +2094,12 @@
Exercise 2: Dimensionality Reduction
-+@@ -2328,7 +2372,7 @@First subset the object to include only highly variable genes (
+sce2 <- sce[hvg.sce.var,]
) and then apply theNewWave
function to the new object setting -K=10
to obtain the first 10 dimensions.K = 10
to obtain the first 10 dimensions.Challenge
-+We see in the help documentation for
?clusterCells
that all of the clustering algorithm details are handled through the @@ -2479,7 +2523,7 @@Challenge
-+One important reason why is because averages over all other clusters can be sensitive to the cell type composition. If a rare cell type shows @@ -2773,7 +2817,7 @@
Challenge
-+Use
BiocParallel
and theBPPARAM
argument! This example will set it to use four cores on your laptop, but you can @@ -3232,7 +3276,7 @@Challenge
-+The example that jumps out most strongly to the eye is ExE endoderm, which doesn’t show clear separate modes. Simultaneously, Endothelium @@ -3381,7 +3425,7 @@
Exercise 1: Clustering
-+The
NNGraphParam
constructor has an argumentcluster.args
. This allows to specify arguments passed on to @@ -3398,7 +3442,7 @@Give me a hint
-+@@ -3427,7 +3471,7 @@TODO
Exercise 2: Cluster annotation
-+Use the
goana()
function from the limma package to identify GO BP terms that are overrepresented in the list of @@ -3441,7 +3485,7 @@Give me a hint
-+@@ -3471,7 +3515,7 @@TODO
Exercise 3: Workflow
-+@@ -5121,7 +5165,7 @@TODO
Challenge
-+From
?MulticoreParam
:@@ -5337,7 +5381,7 @@Challenge
-+This code block calculates the exact PCA coordinates. Another thing to note: PC vectors are only identified up to a sign flip. We can see @@ -5734,7 +5778,7 @@
Exercise 1: Out of memory representation
-+See the
HDF5Array
function for reading from HDF5 and thewriteHDF5Array
function for writing to HDF5 from the HDF5Array @@ -5748,7 +5792,7 @@Give me a hint
-+R @@ -5827,7 +5871,7 @@
Exercise 2: Parallelization
-+Use the function
@@ -5840,7 +5884,7 @@system.time
to obtain the runtime of each job.Give me a hint
-+R @@ -6410,7 +6454,7 @@
Exercise 1
-+R @@ -6443,7 +6487,7 @@
Exercise 2
-+R @@ -6475,7 +6519,7 @@
Exercise 3
-+R @@ -6516,7 +6560,7 @@
Exercise 4
-+R diff --git a/instructor/cell_type_annotation.html b/instructor/cell_type_annotation.html index 685cabc..c42eb17 100644 --- a/instructor/cell_type_annotation.html +++ b/instructor/cell_type_annotation.html @@ -509,7 +509,7 @@
Challenge
-+We see in the help documentation for
?clusterCells
that all of the clustering algorithm details are handled through the @@ -658,7 +658,7 @@Challenge
-+One important reason why is because averages over all other clusters can be sensitive to the cell type composition. If a rare cell type shows @@ -949,7 +949,7 @@
Challenge
-+Use
BiocParallel
and theBPPARAM
argument! This example will set it to use four cores on your laptop, but you can @@ -1407,7 +1407,7 @@Challenge
-+The example that jumps out most strongly to the eye is ExE endoderm, which doesn’t show clear separate modes. Simultaneously, Endothelium @@ -1552,7 +1552,7 @@
Exercise 1: Clustering
-+The
NNGraphParam
constructor has an argumentcluster.args
. This allows to specify arguments passed on to @@ -1569,7 +1569,7 @@Give me a hint
-+@@ -1598,7 +1598,7 @@TODO
Exercise 2: Cluster annotation
-+