diff --git a/fig/multi-sample-rendered-unnamed-chunk-10-1.png b/fig/multi-sample-rendered-unnamed-chunk-10-1.png
index 27398d6..1591c64 100644
Binary files a/fig/multi-sample-rendered-unnamed-chunk-10-1.png and b/fig/multi-sample-rendered-unnamed-chunk-10-1.png differ
diff --git a/fig/multi-sample-rendered-unnamed-chunk-11-1.png b/fig/multi-sample-rendered-unnamed-chunk-11-1.png
new file mode 100644
index 0000000..27398d6
Binary files /dev/null and b/fig/multi-sample-rendered-unnamed-chunk-11-1.png differ
diff --git a/fig/multi-sample-rendered-unnamed-chunk-14-1.png b/fig/multi-sample-rendered-unnamed-chunk-14-1.png
new file mode 100644
index 0000000..057948b
Binary files /dev/null and b/fig/multi-sample-rendered-unnamed-chunk-14-1.png differ
diff --git a/fig/multi-sample-rendered-unnamed-chunk-16-1.png b/fig/multi-sample-rendered-unnamed-chunk-16-1.png
new file mode 100644
index 0000000..b92deff
Binary files /dev/null and b/fig/multi-sample-rendered-unnamed-chunk-16-1.png differ
diff --git a/fig/multi-sample-rendered-unnamed-chunk-2-2.png b/fig/multi-sample-rendered-unnamed-chunk-2-2.png
index ddb5f7b..3d6373a 100644
Binary files a/fig/multi-sample-rendered-unnamed-chunk-2-2.png and b/fig/multi-sample-rendered-unnamed-chunk-2-2.png differ
diff --git a/fig/multi-sample-rendered-unnamed-chunk-24-1.png b/fig/multi-sample-rendered-unnamed-chunk-24-1.png
index 508457f..e0aea89 100644
Binary files a/fig/multi-sample-rendered-unnamed-chunk-24-1.png and b/fig/multi-sample-rendered-unnamed-chunk-24-1.png differ
diff --git a/fig/multi-sample-rendered-unnamed-chunk-24-2.png b/fig/multi-sample-rendered-unnamed-chunk-24-2.png
index 1220eb3..3396154 100644
Binary files a/fig/multi-sample-rendered-unnamed-chunk-24-2.png and b/fig/multi-sample-rendered-unnamed-chunk-24-2.png differ
diff --git a/fig/multi-sample-rendered-unnamed-chunk-4-1.png b/fig/multi-sample-rendered-unnamed-chunk-4-1.png
new file mode 100644
index 0000000..734e4cf
Binary files /dev/null and b/fig/multi-sample-rendered-unnamed-chunk-4-1.png differ
diff --git a/md5sum.txt b/md5sum.txt
index dadd932..a083d42 100644
--- a/md5sum.txt
+++ b/md5sum.txt
@@ -7,7 +7,7 @@
"episodes/intro-sce.Rmd" "88de9550fb00214022d4d0ada77a964b" "site/built/intro-sce.md" "2024-11-11"
"episodes/eda_qc.Rmd" "1ea032eab862e75d5f3adefa23197a3d" "site/built/eda_qc.md" "2024-11-11"
"episodes/cell_type_annotation.Rmd" "68a299eb32d85b6a30af9215e7a941ca" "site/built/cell_type_annotation.md" "2024-11-11"
-"episodes/multi-sample.Rmd" "4422d860318f365ac88f7d7f0253b47a" "site/built/multi-sample.md" "2024-11-11"
+"episodes/multi-sample.Rmd" "bba478a623adc5dcfd260ae0529cc49c" "site/built/multi-sample.md" "2024-11-12"
"episodes/large_data.Rmd" "f64ef24d0547fa7a29c3a57f100f77ab" "site/built/large_data.md" "2024-11-11"
"episodes/hca.Rmd" "873df251787c01aff0a1e7671f463880" "site/built/hca.md" "2024-11-11"
"instructors/instructor-notes.md" "79f31f78e0c09e7771975b1a14d6cd08" "site/built/instructor-notes.md" "2024-10-11"
diff --git a/multi-sample.md b/multi-sample.md
index 5ae7311..d672ae6 100644
--- a/multi-sample.md
+++ b/multi-sample.md
@@ -34,7 +34,7 @@ As before, we will use the the wild-type data from the Tal1 chimera experiment:
Note that this is a paired design in which for each biological replicate (pool 3, 4, and 5), we have both host and injected cells.
-We start by loading the data and doing a quick exploratory analysis, essentially applying the normalization and visualization techniques that we have seen in the previous lectures to all samples. Note that this time we're selecting samples 5 to 10, not just 5 by itself.
+We start by loading the data and doing a quick exploratory analysis, essentially applying the normalization and visualization techniques that we have seen in the previous lectures to all samples. Note that this time we're selecting samples 5 to 10, not just 5 by itself. Also note the `type = "processed"` argument: we are explicitly selecting the version of the data that has already been QC processed.
@@ -116,7 +116,7 @@ cell_30702 0.00108837 0.550807
cell_30703 0.82369305 1.184919
```
-To speed up computations, after removing doublets, we randomly select 50% cells per sample.
+For the sake of making these examples run faster, we drop some problematic types (stripped nuclei and doublets) and also randomly select 50% cells per sample.
``` r
@@ -134,7 +134,7 @@ idx <- unlist(tapply(colnames(sce), sce$sample, function(x) {
sce <- sce[,idx]
```
-We now normalize the data, run some dimensionality reduction steps, and visualize them in a tSNE plot.
+We now normalize the data, run some dimensionality reduction steps, and visualize them in a tSNE plot. In this case we happen to have a ton of cell types to visualize, so we define a custom palette with a lot of visually distinct colors (adapted from the `polychrome` palette in the [`pals` package](https://cran.r-project.org/web/packages/pals/vignettes/pals_examples.html)).
``` r
@@ -156,8 +156,15 @@ plotTSNE(sce, colour_by = "sample")
``` r
+color_vec <- c("#5A5156", "#E4E1E3", "#F6222E", "#FE00FA", "#16FF32", "#3283FE",
+ "#FEAF16", "#B00068", "#1CFFCE", "#90AD1C", "#2ED9FF", "#DEA0FD",
+ "#AA0DFE", "#F8A19F", "#325A9B", "#C4451C", "#1C8356", "#85660D",
+ "#B10DA1", "#3B00FB", "#1CBE4F", "#FA0087", "#333333", "#F7E1A0",
+ "#C075A6", "#782AB6", "#AAF400", "#BDCDFF", "#822E1C", "#B5EFB5",
+ "#7ED7D1", "#1C7F93", "#D85FF7", "#683B79", "#66B0FF", "#FBE426")
+
plotTSNE(sce, colour_by = "celltype.mapped") +
- scale_color_discrete() +
+ scale_color_manual(values = color_vec) +
theme(legend.position = "bottom")
```
@@ -208,6 +215,18 @@ plotTSNE(merged, colour_by = "batch")
+We can also see that when coloring by cell type, the cell types are now nicely confined to their own clusters for the most part:
+
+
+``` r
+plotTSNE(merged, colour_by = "celltype.mapped") +
+ scale_color_manual(values = color_vec) +
+ theme(legend.position = "bottom")
+```
+
+
+
+
Once we removed the sample batch effect, we can proceed with the Differential
Expression Analysis.
@@ -424,7 +443,7 @@ for (i in seq_len(ncol(y))) {
}
```
-
+
``` r
par(mfrow = c(1,1))
@@ -441,7 +460,7 @@ limma::plotMDS(cpm(y, log = TRUE),
col = ifelse(y$samples$tomato, "red", "blue"))
```
-
+
We then construct a design matrix by including both the pool and the tomato as factors.
This design indicates which samples belong to which pool and condition, so we can
@@ -496,7 +515,7 @@ Additionally, the Common and Trend BCV are shown in `red` and `blue`.
plotBCV(y)
```
-
+
We then fit a Quasi-Likelihood (QL) negative binomial generalized linear model for each gene.
The `robust = TRUE` parameter avoids distortions from highly variable clusters.
@@ -530,7 +549,7 @@ QL dispersion estimates for each gene as a function of abundance. Raw estimates
plotQLDisp(fit)
```
-
+
We then use an empirical Bayes quasi-likelihood F-test to test for differential expression (due to tomato injection) per each gene at a False Discovery Rate (FDR) of 5%.
The low amount of DGEs highlights that the tomato injection effect has a low
@@ -814,7 +833,7 @@ Use the `pheatmap` package to create a heatmap of the abundances table. Does it
:::::::::::::: hint
-You can simply hand `pheatmap()` a matrix as its only argument. `pheatmap()` has a million options you can tweak, but the defaults are usually pretty good.
+You can simply hand `pheatmap()` a matrix as its only argument. `pheatmap()` has a million options you can adjust, but the defaults are usually pretty good. Try to overlay sample-level information with the `annotation_col` argument for an extra challenge.
:::::::::::::::::::::::
@@ -825,7 +844,22 @@ You can simply hand `pheatmap()` a matrix as its only argument. `pheatmap()` has
pheatmap(y.ab$counts)
```
-
+
+
+``` r
+anno_df <- y.ab$samples[,c("tomato", "pool")]
+
+anno_df$pool = as.character(anno_df$pool)
+
+anno_df$tomato <- ifelse(anno_df$tomato,
+ "tomato+",
+ "tomato-")
+
+pheatmap(y.ab$counts,
+ annotation_col = anno_df)
+```
+
+
The top DA result was a decrease in ExE ectoderm in the tomato condition, which you can sort of see, especially if you `log1p()` the counts or discard rows that show much higher values. ExE ectoderm counts were much higher in samples 8 and 10 compared to 5, 7, and 9.
@@ -875,7 +909,7 @@ ggplot(comp_df, aes(logFC.x, logFC.y)) +
geom_point()
```
-
+
``` r
# Reshape to long format for ggplot facets. This is 1000x times easier to do
@@ -893,7 +927,7 @@ ggplot(pval_df, aes(Pvalue)) +
facet_wrap("pool_factor")
```
-
+
We can see that in this case, the logFC estimates are strongly consistent between the two models, which tells us that the inclusion of the `pool` factor in the model doesn't strongly influence the estimate of the `tomato` coefficients in this case.