diff --git a/fig/multi-sample-rendered-unnamed-chunk-10-1.png b/fig/multi-sample-rendered-unnamed-chunk-10-1.png index 27398d6..1591c64 100644 Binary files a/fig/multi-sample-rendered-unnamed-chunk-10-1.png and b/fig/multi-sample-rendered-unnamed-chunk-10-1.png differ diff --git a/fig/multi-sample-rendered-unnamed-chunk-11-1.png b/fig/multi-sample-rendered-unnamed-chunk-11-1.png new file mode 100644 index 0000000..27398d6 Binary files /dev/null and b/fig/multi-sample-rendered-unnamed-chunk-11-1.png differ diff --git a/fig/multi-sample-rendered-unnamed-chunk-14-1.png b/fig/multi-sample-rendered-unnamed-chunk-14-1.png new file mode 100644 index 0000000..057948b Binary files /dev/null and b/fig/multi-sample-rendered-unnamed-chunk-14-1.png differ diff --git a/fig/multi-sample-rendered-unnamed-chunk-16-1.png b/fig/multi-sample-rendered-unnamed-chunk-16-1.png new file mode 100644 index 0000000..b92deff Binary files /dev/null and b/fig/multi-sample-rendered-unnamed-chunk-16-1.png differ diff --git a/fig/multi-sample-rendered-unnamed-chunk-2-2.png b/fig/multi-sample-rendered-unnamed-chunk-2-2.png index ddb5f7b..3d6373a 100644 Binary files a/fig/multi-sample-rendered-unnamed-chunk-2-2.png and b/fig/multi-sample-rendered-unnamed-chunk-2-2.png differ diff --git a/fig/multi-sample-rendered-unnamed-chunk-24-1.png b/fig/multi-sample-rendered-unnamed-chunk-24-1.png index 508457f..e0aea89 100644 Binary files a/fig/multi-sample-rendered-unnamed-chunk-24-1.png and b/fig/multi-sample-rendered-unnamed-chunk-24-1.png differ diff --git a/fig/multi-sample-rendered-unnamed-chunk-24-2.png b/fig/multi-sample-rendered-unnamed-chunk-24-2.png index 1220eb3..3396154 100644 Binary files a/fig/multi-sample-rendered-unnamed-chunk-24-2.png and b/fig/multi-sample-rendered-unnamed-chunk-24-2.png differ diff --git a/fig/multi-sample-rendered-unnamed-chunk-4-1.png b/fig/multi-sample-rendered-unnamed-chunk-4-1.png new file mode 100644 index 0000000..734e4cf Binary files /dev/null and b/fig/multi-sample-rendered-unnamed-chunk-4-1.png differ diff --git a/md5sum.txt b/md5sum.txt index dadd932..a083d42 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -7,7 +7,7 @@ "episodes/intro-sce.Rmd" "88de9550fb00214022d4d0ada77a964b" "site/built/intro-sce.md" "2024-11-11" "episodes/eda_qc.Rmd" "1ea032eab862e75d5f3adefa23197a3d" "site/built/eda_qc.md" "2024-11-11" "episodes/cell_type_annotation.Rmd" "68a299eb32d85b6a30af9215e7a941ca" "site/built/cell_type_annotation.md" "2024-11-11" -"episodes/multi-sample.Rmd" "4422d860318f365ac88f7d7f0253b47a" "site/built/multi-sample.md" "2024-11-11" +"episodes/multi-sample.Rmd" "bba478a623adc5dcfd260ae0529cc49c" "site/built/multi-sample.md" "2024-11-12" "episodes/large_data.Rmd" "f64ef24d0547fa7a29c3a57f100f77ab" "site/built/large_data.md" "2024-11-11" "episodes/hca.Rmd" "873df251787c01aff0a1e7671f463880" "site/built/hca.md" "2024-11-11" "instructors/instructor-notes.md" "79f31f78e0c09e7771975b1a14d6cd08" "site/built/instructor-notes.md" "2024-10-11" diff --git a/multi-sample.md b/multi-sample.md index 5ae7311..d672ae6 100644 --- a/multi-sample.md +++ b/multi-sample.md @@ -34,7 +34,7 @@ As before, we will use the the wild-type data from the Tal1 chimera experiment: Note that this is a paired design in which for each biological replicate (pool 3, 4, and 5), we have both host and injected cells. -We start by loading the data and doing a quick exploratory analysis, essentially applying the normalization and visualization techniques that we have seen in the previous lectures to all samples. Note that this time we're selecting samples 5 to 10, not just 5 by itself. +We start by loading the data and doing a quick exploratory analysis, essentially applying the normalization and visualization techniques that we have seen in the previous lectures to all samples. Note that this time we're selecting samples 5 to 10, not just 5 by itself. Also note the `type = "processed"` argument: we are explicitly selecting the version of the data that has already been QC processed. @@ -116,7 +116,7 @@ cell_30702 0.00108837 0.550807 cell_30703 0.82369305 1.184919 ``` -To speed up computations, after removing doublets, we randomly select 50% cells per sample. +For the sake of making these examples run faster, we drop some problematic types (stripped nuclei and doublets) and also randomly select 50% cells per sample. ``` r @@ -134,7 +134,7 @@ idx <- unlist(tapply(colnames(sce), sce$sample, function(x) { sce <- sce[,idx] ``` -We now normalize the data, run some dimensionality reduction steps, and visualize them in a tSNE plot. +We now normalize the data, run some dimensionality reduction steps, and visualize them in a tSNE plot. In this case we happen to have a ton of cell types to visualize, so we define a custom palette with a lot of visually distinct colors (adapted from the `polychrome` palette in the [`pals` package](https://cran.r-project.org/web/packages/pals/vignettes/pals_examples.html)). ``` r @@ -156,8 +156,15 @@ plotTSNE(sce, colour_by = "sample") ``` r +color_vec <- c("#5A5156", "#E4E1E3", "#F6222E", "#FE00FA", "#16FF32", "#3283FE", + "#FEAF16", "#B00068", "#1CFFCE", "#90AD1C", "#2ED9FF", "#DEA0FD", + "#AA0DFE", "#F8A19F", "#325A9B", "#C4451C", "#1C8356", "#85660D", + "#B10DA1", "#3B00FB", "#1CBE4F", "#FA0087", "#333333", "#F7E1A0", + "#C075A6", "#782AB6", "#AAF400", "#BDCDFF", "#822E1C", "#B5EFB5", + "#7ED7D1", "#1C7F93", "#D85FF7", "#683B79", "#66B0FF", "#FBE426") + plotTSNE(sce, colour_by = "celltype.mapped") + - scale_color_discrete() + + scale_color_manual(values = color_vec) + theme(legend.position = "bottom") ``` @@ -208,6 +215,18 @@ plotTSNE(merged, colour_by = "batch") +We can also see that when coloring by cell type, the cell types are now nicely confined to their own clusters for the most part: + + +``` r +plotTSNE(merged, colour_by = "celltype.mapped") + + scale_color_manual(values = color_vec) + + theme(legend.position = "bottom") +``` + + + + Once we removed the sample batch effect, we can proceed with the Differential Expression Analysis. @@ -424,7 +443,7 @@ for (i in seq_len(ncol(y))) { } ``` - + ``` r par(mfrow = c(1,1)) @@ -441,7 +460,7 @@ limma::plotMDS(cpm(y, log = TRUE), col = ifelse(y$samples$tomato, "red", "blue")) ``` - + We then construct a design matrix by including both the pool and the tomato as factors. This design indicates which samples belong to which pool and condition, so we can @@ -496,7 +515,7 @@ Additionally, the Common and Trend BCV are shown in `red` and `blue`. plotBCV(y) ``` - + We then fit a Quasi-Likelihood (QL) negative binomial generalized linear model for each gene. The `robust = TRUE` parameter avoids distortions from highly variable clusters. @@ -530,7 +549,7 @@ QL dispersion estimates for each gene as a function of abundance. Raw estimates plotQLDisp(fit) ``` - + We then use an empirical Bayes quasi-likelihood F-test to test for differential expression (due to tomato injection) per each gene at a False Discovery Rate (FDR) of 5%. The low amount of DGEs highlights that the tomato injection effect has a low @@ -814,7 +833,7 @@ Use the `pheatmap` package to create a heatmap of the abundances table. Does it :::::::::::::: hint -You can simply hand `pheatmap()` a matrix as its only argument. `pheatmap()` has a million options you can tweak, but the defaults are usually pretty good. +You can simply hand `pheatmap()` a matrix as its only argument. `pheatmap()` has a million options you can adjust, but the defaults are usually pretty good. Try to overlay sample-level information with the `annotation_col` argument for an extra challenge. ::::::::::::::::::::::: @@ -825,7 +844,22 @@ You can simply hand `pheatmap()` a matrix as its only argument. `pheatmap()` has pheatmap(y.ab$counts) ``` - + + +``` r +anno_df <- y.ab$samples[,c("tomato", "pool")] + +anno_df$pool = as.character(anno_df$pool) + +anno_df$tomato <- ifelse(anno_df$tomato, + "tomato+", + "tomato-") + +pheatmap(y.ab$counts, + annotation_col = anno_df) +``` + + The top DA result was a decrease in ExE ectoderm in the tomato condition, which you can sort of see, especially if you `log1p()` the counts or discard rows that show much higher values. ExE ectoderm counts were much higher in samples 8 and 10 compared to 5, 7, and 9. @@ -875,7 +909,7 @@ ggplot(comp_df, aes(logFC.x, logFC.y)) + geom_point() ``` - + ``` r # Reshape to long format for ggplot facets. This is 1000x times easier to do @@ -893,7 +927,7 @@ ggplot(pval_df, aes(Pvalue)) + facet_wrap("pool_factor") ``` - + We can see that in this case, the logFC estimates are strongly consistent between the two models, which tells us that the inclusion of the `pool` factor in the model doesn't strongly influence the estimate of the `tomato` coefficients in this case.