Merge pull request #20 from ccb-hms/add_exercises

Add exercises
carpentries-incubator · Aug 26, 2024 · a189724 · a189724
2 parents 4426fa5 + 85de456
commit a189724
Showing 1 changed file with 153 additions and 84 deletions.
diff --git a/episodes/large_data.Rmd b/episodes/large_data.Rmd
@@ -6,9 +6,9 @@ exercises: 2 # Minutes of exercises in the lesson
 
 :::::::::::::::::::::::::::::::::::::: questions 
 
-- How to work with single-cell datasets that are too large to fit in memory?
-- How to speed up single-cell analysis workflows for large datasets?
-- How to convert between popular single-cell data formats?
+- How do we work with single-cell datasets that are too large to fit in memory?
+- How do we speed up single-cell analysis workflows for large datasets?
+- How do we convert between popular single-cell data formats?
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
@@ -22,7 +22,7 @@ exercises: 2 # Minutes of exercises in the lesson
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ```{r chunk-opts, include=FALSE}
-knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
+knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, digits = 3)
 library(BiocStyle)
 ```
 
@@ -52,18 +52,16 @@ For example, the 1.3 million brain cell data set from 10X Genomics
 would require over 100 GB of RAM to hold as a `matrix` and around 30 GB as a `dgCMatrix`.
 This makes it challenging to explore the data on anything less than a HPC system.
 
-The obvious solution is to use a file-backed matrix representation where the 
-data are held on disk and subsets are retrieved into memory as requested.
-While a number of implementations of file-backed matrices are available 
-(e.g.,
-[bigmemory](https://cran.r-project.org/web/packages/bigmemory/index.html), 
-[matter](https://bioconductor.org/packages/matter)), 
-we will be using the implementation from the 
-[HDF5Array](https://bioconductor.org/packages/HDF5Array) package.
-This uses the popular HDF5 format as the underlying data store, which provides
-a measure of standardization and portability across systems.
-We demonstrate with a subset of 20,000 cells from the 1.3 million brain cell 
-data set, as provided by the
+The obvious solution is to use a file-backed matrix representation where the
+data are held on disk and subsets are retrieved into memory as requested. While
+a number of implementations of file-backed matrices are available (e.g.,
+[bigmemory](https://cran.r-project.org/web/packages/bigmemory/index.html),
+[matter](https://bioconductor.org/packages/matter)), we will be using the
+implementation from the [HDF5Array](https://bioconductor.org/packages/HDF5Array)
+package. This uses the popular HDF5 format as the underlying data store, which
+provides a measure of standardization and portability across systems. We
+demonstrate with a subset of 20,000 cells from the 1.3 million brain cell data
+set, as provided by the
 [TENxBrainData](https://bioconductor.org/packages/TENxBrainData) package.
 
 ```{r tenx-brain}
@@ -117,15 +115,13 @@ qcstats <- perCellQCMetrics(sce.brain, subsets = list(Mt = is.mito))
 qcstats
 ```
 
-Needless to say, data access from file-backed representations is slower than that
-from in-memory representations.
-The time spent retrieving data from disk is an unavoidable cost of reducing
-memory usage.
-Whether this is tolerable depends on the application.
-One example usage pattern involves performing the heavy computing quickly with 
-in-memory representations on HPC systems with plentiful memory, and then
-distributing file-backed counterparts to individual users for exploration and 
-visualization on their personal machines.
+Needless to say, data access from file-backed representations is slower than
+that from in-memory representations. The time spent retrieving data from disk is
+an unavoidable cost of reducing memory usage. Whether this is tolerable depends
+on the application. One example usage pattern involves performing the heavy
+computing quickly with in-memory representations on HPC systems with plentiful
+memory, and then distributing file-backed counterparts to individual users for
+exploration and visualization on their personal machines.
 
 ## Parallelization
 
@@ -197,15 +193,34 @@ rs <- list(walltime = 7200, memory = 8000, ncpus = 1)
 bpp <- BatchtoolsParam(10, cluster = "slurm", resources = rs)
 ```
 
-Parallelization is best suited for CPU-intensive calculations where the division
-of labor results in a concomitant reduction in compute time. It is not suited
-for tasks that are bounded by other compute resources, e.g., memory or file I/O
-(though the latter is less of an issue on HPC systems with parallel read/write).
-In particular, R itself is inherently single-core, so many of the
+Parallelization is best suited for independent, CPU-intensive tasks where the
+division of labor results in a concomitant reduction in compute time. It is not
+suited for tasks that are bounded by other compute resources, e.g., memory or
+file I/O (though the latter is less of an issue on HPC systems with parallel
+read/write). In particular, R itself is inherently single-core, so many of the
 parallelization backends involve (i) setting up one or more separate R sessions,
 (ii) loading the relevant packages and (iii) transmitting the data to that
 session. Depending on the nature and size of the task, this overhead may
-outweigh any benefit from parallel computing.
+outweigh any benefit from parallel computing. While the default behavior of the
+parallel job managers often works well for simple cases, it is sometimes
+necessary to explicitly specify what data/libraries are sent to / loaded on the
+parallel workers in order to avoid unnecessary overhead.
+
+:::: challenge
+
+How do you turn on progress bars with parallel processing?
+
+::: solution
+
+From `?MulticoreParam` : 
+
+> `progressbar` logical(1) Enable progress bar (based on plyr:::progress_text). Enabling the progress bar changes the default value of tasks to .Machine$integer.max, so that progress is reported for each element of X.
+
+Progress bars are a helpful way to gauge whether that task is going to take 5 minutes or 5 hours.
+
+:::
+
+::::
 
 ## Fast approximations
 
@@ -230,15 +245,14 @@ colLabels(sce) <- clusterCells(sce, use.dimred = "PCA",
                                BLUSPARAM = NNGraphParam(cluster.fun = "louvain"))
 ```
 
-The above clusters on a nearest neighbor graph generated with
-an exact neighbour search.
-We repeat this below using an approximate search, implemented using the
-[Annoy](https://github.com/spotify/Annoy) algorithm.
-This involves constructing a `AnnoyParam` object to specify the search algorithm
-and then passing it to the parameterization of the `NNGraphParam()` function.
-The results from the exact and approximate searches are consistent with most
-clusters from the former re-appearing in the latter.
-This suggests that the inaccuracy from the approximation can be largely ignored.
+The above clusters on a nearest neighbor graph generated with an exact neighbour
+search. We repeat this below using an approximate search, implemented using the
+[Annoy](https://github.com/spotify/Annoy) algorithm. This involves constructing
+a `AnnoyParam` object to specify the search algorithm and then passing it to the
+parameterization of the `NNGraphParam()` function. The results from the exact
+and approximate searches are consistent with most clusters from the former
+re-appearing in the latter. This suggests that the inaccuracy from the
+approximation can be largely ignored.
 
 ```{r cluster-annoy}
 library(scran)
@@ -249,9 +263,10 @@ clusters <- clusterCells(sce, use.dimred = "PCA",
 table(exact = colLabels(sce), approx = clusters)
 ```
 
+This can be quantified by calculating the pairwise Rand index: 
+
 ```{r check-annoy, echo = FALSE}
 # Check that clusterings are very similar
-library(bluster)
 rand <- pairwiseRand(colLabels(sce), clusters, mode = "index")
 stopifnot(rand > 0.85)
 ```
@@ -277,15 +292,13 @@ mean(exact$index != approx$index)
 ### Singular value decomposition 
 
 The singular value decomposition (SVD) underlies the PCA used throughout our
-analyses, e.g., in `denoisePCA()`, `fastMNN()`, `doubletCells()`.
-(Briefly, the right singular vectors are the eigenvectors of the gene-gene
-covariance matrix, where each eigenvector represents the axis of maximum remaining
-variation in the PCA.)
-The default `base::svd()` function performs an exact SVD that is not performant
-for large datasets.
-Instead, we use fast approximate methods from the `r CRANpkg("irlba")` and
-`r CRANpkg("rsvd")` packages, conveniently wrapped into the 
-`r Biocpkg("BiocSingular")` package for ease of use and package development.
+analyses, e.g., in `denoisePCA()`, `fastMNN()`, `doubletCells()`. (Briefly, the
+right singular vectors are the eigenvectors of the gene-gene covariance matrix,
+where each eigenvector represents the axis of maximum remaining variation in the
+PCA.) The default `base::svd()` function performs an exact SVD that is not
+performant for large datasets. Instead, we use fast approximate methods from the
+`r CRANpkg("irlba")` and `r CRANpkg("rsvd")` packages, conveniently wrapped into
+the `r Biocpkg("BiocSingular")` package for ease of use and package development.
 Specifically, we can change the SVD algorithm used in any of these functions by
 simply specifying an alternative value for the `BSPARAM` argument.
 
@@ -303,16 +316,57 @@ i.out <- runPCA(sce, ncomponents = 20, BSPARAM = IrlbaParam())
 str(reducedDim(i.out, "PCA"))
 ```
 
-Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD with
-negligible loss of accuracy.
-This motivates their default use in many `r Biocpkg("scran")` and `r Biocpkg("scater")`
-functions, at the cost of requiring users to set the seed to guarantee reproducibility.
-IRLBA can occasionally fail to converge and require more iterations
-(passed via `maxit=` in `IrlbaParam()`), while RSVD involves an explicit trade-off
-between accuracy and speed based on its oversampling parameter (`p=`) and number
-of power iterations (`q=`).
-We tend to prefer IRLBA as its default behavior is more accurate, though RSVD is
-much faster for file-backed matrices.
+Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD and
+usually yield only a negligible loss of accuracy. This motivates their default
+use in many `r Biocpkg("scran")` and `r Biocpkg("scater")` functions, at the
+cost of requiring users to set the seed to guarantee reproducibility. IRLBA can
+occasionally fail to converge and require more iterations (passed via `maxit=`
+in `IrlbaParam()`), while RSVD involves an explicit trade-off between accuracy
+and speed based on its oversampling parameter (`p=`) and number of power
+iterations (`q=`). We tend to prefer IRLBA as its default behavior is more
+accurate, though RSVD is much faster for file-backed matrices.
+
+:::: challenge
+
+The uncertainty from approximation error is sometimes psychologically
+objectionable. "Why can't my computer just give me the right answer?" One way to
+alleviate this feeling is to quantify the approximation error on a small test
+set like the sce we have here. Using the `ExactParam()` class, visualize the
+error in PC1 coordinates compared to the RSVD results.
+
+::: solution
+This code block calculates the exact PCA coordinates. Another thing to note: PC vectors are only identified up to a sign flip. We can see that the RSVD PC1 vector points in the 
+```{r}
+set.seed(123)
+e.out <- runPCA(sce, ncomponents = 20, BSPARAM = ExactParam())
+str(reducedDim(e.out, "PCA"))
+
+reducedDim(e.out, "PCA")[1:5,1:3]
+reducedDim(r.out, "PCA")[1:5,1:3]
+```
+
+For the sake of visualizing the error we can just flip the PC1 coordinates:
+
+```{r}
+reducedDim(r.out, "PCA") = -1 * reducedDim(r.out, "PCA")
+```
+
+From there we can visualize the error with a histogram:
+
+```{r}
+error = reducedDim(r.out, "PCA")[,"PC1"] - 
+        reducedDim(e.out, "PCA")[,"PC1"]
+
+data.frame(approx_error = error) |> 
+  ggplot(aes(approx_error)) + 
+  geom_histogram()
+```
+
+It's almost never more than .001 in this case. 
+
+:::
+
+::::
 
 ## Interoperability with popular single-cell analysis ecosytems
 
@@ -330,16 +384,15 @@ library(Seurat)
 ```
 
 Although the basic processing of single-cell data with Bioconductor packages
-(described in the [OSCA book](https://bioconductor.org/books/release/OSCA/))
-and with Seurat is very similar and will produce overall roughly identical
-results, there is also complementary functionality with regard to cell type
-annotation, dataset integration, and downstream analysis.
-To make the most of both ecosystems it is therefore beneficial to be able 
-to easily switch between a `SeuratObject` and a `SingleCellExperiment`.
-See also the Seurat
-[conversion vignette](https://satijalab.org/seurat/articles/conversion_vignette.html)
-for conversion to/from other popular single cell formats such as the AnnData
-format used by [scanpy](https://scanpy.readthedocs.io/en/stable/).
+(described in the [OSCA book](https://bioconductor.org/books/release/OSCA/)) and
+with Seurat is very similar and will produce overall roughly identical results,
+there is also complementary functionality with regard to cell type annotation,
+dataset integration, and downstream analysis. To make the most of both
+ecosystems it is therefore beneficial to be able to easily switch between a
+`SeuratObject` and a `SingleCellExperiment`. See also the Seurat [conversion
+vignette](https://satijalab.org/seurat/articles/conversion_vignette.html) for
+conversion to/from other popular single cell formats such as the AnnData format
+used by [scanpy](https://scanpy.readthedocs.io/en/stable/).
 
 Here, we demonstrate converting the Seurat object produced in Seurat's
 [PBMC tutorial](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)
@@ -395,14 +448,14 @@ sobj
 
 [Scanpy](https://scanpy.readthedocs.io) is a scalable toolkit for analyzing
 single-cell gene expression data built jointly with
-[anndata](https://anndata.readthedocs.io/). It includes
-preprocessing, visualization, clustering, trajectory inference and differential
-expression testing. The Python-based implementation efficiently deals with
-datasets of more than one million cells. Scanpy is developed and maintained by
-the [Theis lab]() and is released under a
-[BSD-3-Clause license](https://github.com/scverse/scanpy/blob/master/LICENSE).
-Scanpy is part of the [scverse](https://scverse.org/), a Python-based ecosystem
-for single-cell omics data analysis.
+[anndata](https://anndata.readthedocs.io/). It includes preprocessing,
+visualization, clustering, trajectory inference and differential expression
+testing. The Python-based implementation efficiently deals with datasets of more
+than one million cells. Scanpy is developed and maintained by the [Theis lab]()
+and is released under a [BSD-3-Clause
+license](https://github.com/scverse/scanpy/blob/master/LICENSE). Scanpy is part
+of the [scverse](https://scverse.org/), a Python-based ecosystem for single-cell
+omics data analysis.
 
 At the core of scanpy's single-cell functionality is the `anndata` data structure,
 scanpy's integrated single-cell data container, which is conceptually very similar
@@ -433,7 +486,7 @@ We can also write a `SingleCellExperiment` to an H5AD file with the
 chimera mouse gastrulation dataset. 
 
 ```{r write-h5ad, message = FALSE}
-out.file <- tempfile(pattern = ".h5ad")
+out.file <- tempfile(fileext = ".h5ad")
 writeH5AD(sce, file = out.file)
 ```
 
@@ -454,10 +507,10 @@ sessionInfo()
 
 #### Exercise 1: Out of memory representation
 
-Write the counts matrix of the wild-type chimera
-mouse gastrulation dataset to an HDF5 file. Create another counts matrix that
-reads the data from the HDF5 file. Compare memory usage of holding the entire
-matrix in memory as opposed to holding the data out of memory.  
+Write the counts matrix of the wild-type chimera mouse gastrulation dataset to
+an HDF5 file. Create another counts matrix that reads the data from the HDF5
+file. Compare memory usage of holding the entire matrix in memory as opposed to
+holding the data out of memory.
 
 :::::::::::::: hint
 
@@ -468,7 +521,23 @@ function for writing to HDF5 from the `r Biocpkg("HDF5Array")` package.
 
 :::::::::::::: solution
 
-TODO
+```{r}
+
+wt_out = tempfile(fileext = ".h5")
+
+wt_counts = counts(WTChimeraData())
+
+writeHDF5Array(wt_counts,
+               name = "wt_counts",
+               file = wt_out)
+
+oom_wt = HDF5Array(wt_out, "wt_counts")
+
+object.size(wt_counts)
+
+object.size(oom_wt)
+```
+
 :::::::::::::::::::::::
 
 :::::::::::::::::::::::::::::::::::::::::::::