From a588cf2e046add326405ee399cd955d2ea5a8f98 Mon Sep 17 00:00:00 2001
From: Andrew Ghazi <6763470+andrewGhazi@users.noreply.github.com>
Date: Mon, 30 Sep 2024 10:14:51 -0400
Subject: [PATCH] answer end exercises, remove workflow ex

---
 episodes/cell_type_annotation.Rmd | 69 +++++++++++++++++--------------
 1 file changed, 37 insertions(+), 32 deletions(-)

diff --git a/episodes/cell_type_annotation.Rmd b/episodes/cell_type_annotation.Rmd
index dec0f70..e8cbaac 100644
--- a/episodes/cell_type_annotation.Rmd
+++ b/episodes/cell_type_annotation.Rmd
@@ -572,47 +572,52 @@ of 0.5.
 :::
 
 ::: solution
-TODO
-:::
-:::
-
-::: challenge
-#### Exercise 2: Cluster annotation
+```{r}
+arg_list <- list(objective_function = "modularity",
+                 resolution_parameter = .5)
 
-Another strategy for annotating the clusters is to perform a gene set
-enrichment analysis on the marker genes defining each cluster. This
-identifies the pathways and processes that are (relatively) active in
-each cluster based on upregulation of the associated genes compared to
-other clusters. Focus on the top 100 up-regulated genes in a cluster of
-your choice and perform a gene set enrichment analysis of biological
-process (BP) gene sets from the Gene Ontology (GO).
+sce$leiden_clust <- clusterCells(sce, use.dimred = "PCA",
+                               BLUSPARAM = NNGraphParam(cluster.fun = "leiden", 
+                                                        cluster.args = arg_list))
 
-::: hint
-Use the `goana()` function from the `r Biocpkg("limma")` package to
-identify GO BP terms that are overrepresented in the list of marker
-genes.
-:::
+plotReducedDim(sce, "UMAP", color_by = "leiden_clust")
+```
 
-::: solution
-TODO
 :::
 :::
 
 ::: challenge
-#### Exercise 3: Workflow
-
-The [scRNAseq](https://bioconductor.org/packages/scRNAseq) package
-provides gene-level counts for a collection of public scRNA-seq
-datasets, stored as `SingleCellExperiment` objects with annotated cell-
-and gene-level metadata. Consult the vignette of the
-[scRNAseq](https://bioconductor.org/packages/scRNAseq) package to
-inspect all available datasets and select a dataset of your choice.
-Perform a typical scRNA-seq analysis on this dataset including QC,
-normalization, feature selection, dimensionality reduction, clustering,
-and marker gene detection.
+#### Exercise 2: Reference marker genes
+
+Identify the marker genes in the reference single cell experiment, using the `celltype` labels that come with the dataset as the groups. Compare the top 100 marker genes of two cell types that are close in UMAP space. Do they share similar marker sets?
 
 ::: solution
-TODO
+
+```{r}
+markers <- scoreMarkers(ref, groups = ref$celltype)
+
+markers
+
+# It comes with UMAP precomputed too
+plotReducedDim(ref, dimred = "umap", color_by = "celltype") 
+
+# Repetitive work -> write a function
+order_marker_df <- function(m_df, n = 100) {
+  
+  ord <- order(m_df$mean.AUC, decreasing = TRUE)
+  
+  rownames(m_df[ord,][1:n,])
+}
+
+x <- order_marker_df(markers[["Erythroid2"]])
+
+y <- order_marker_df(markers[["Erythroid3"]])
+
+length(intersect(x,y)) / 100
+```
+
+Turns out there's pretty substantial overlap between `Erythroid2` and `Erythroid3`. It would also be interesting to plot the expression of the set difference to confirm that the remainder are the the genes used to distinguish these two types from each other.
+
 :::
 :::