Skip to content

Commit

Permalink
Merge pull request #20 from ccb-hms/add_exercises
Browse files Browse the repository at this point in the history
Add exercises
  • Loading branch information
andrewGhazi authored Aug 26, 2024
2 parents 4426fa5 + 85de456 commit a189724
Showing 1 changed file with 153 additions and 84 deletions.
237 changes: 153 additions & 84 deletions episodes/large_data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ exercises: 2 # Minutes of exercises in the lesson

:::::::::::::::::::::::::::::::::::::: questions

- How to work with single-cell datasets that are too large to fit in memory?
- How to speed up single-cell analysis workflows for large datasets?
- How to convert between popular single-cell data formats?
- How do we work with single-cell datasets that are too large to fit in memory?
- How do we speed up single-cell analysis workflows for large datasets?
- How do we convert between popular single-cell data formats?

::::::::::::::::::::::::::::::::::::::::::::::::

Expand All @@ -22,7 +22,7 @@ exercises: 2 # Minutes of exercises in the lesson
::::::::::::::::::::::::::::::::::::::::::::::::

```{r chunk-opts, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, digits = 3)
library(BiocStyle)
```

Expand Down Expand Up @@ -52,18 +52,16 @@ For example, the 1.3 million brain cell data set from 10X Genomics
would require over 100 GB of RAM to hold as a `matrix` and around 30 GB as a `dgCMatrix`.
This makes it challenging to explore the data on anything less than a HPC system.

The obvious solution is to use a file-backed matrix representation where the
data are held on disk and subsets are retrieved into memory as requested.
While a number of implementations of file-backed matrices are available
(e.g.,
[bigmemory](https://cran.r-project.org/web/packages/bigmemory/index.html),
[matter](https://bioconductor.org/packages/matter)),
we will be using the implementation from the
[HDF5Array](https://bioconductor.org/packages/HDF5Array) package.
This uses the popular HDF5 format as the underlying data store, which provides
a measure of standardization and portability across systems.
We demonstrate with a subset of 20,000 cells from the 1.3 million brain cell
data set, as provided by the
The obvious solution is to use a file-backed matrix representation where the
data are held on disk and subsets are retrieved into memory as requested. While
a number of implementations of file-backed matrices are available (e.g.,
[bigmemory](https://cran.r-project.org/web/packages/bigmemory/index.html),
[matter](https://bioconductor.org/packages/matter)), we will be using the
implementation from the [HDF5Array](https://bioconductor.org/packages/HDF5Array)
package. This uses the popular HDF5 format as the underlying data store, which
provides a measure of standardization and portability across systems. We
demonstrate with a subset of 20,000 cells from the 1.3 million brain cell data
set, as provided by the
[TENxBrainData](https://bioconductor.org/packages/TENxBrainData) package.

```{r tenx-brain}
Expand Down Expand Up @@ -117,15 +115,13 @@ qcstats <- perCellQCMetrics(sce.brain, subsets = list(Mt = is.mito))
qcstats
```

Needless to say, data access from file-backed representations is slower than that
from in-memory representations.
The time spent retrieving data from disk is an unavoidable cost of reducing
memory usage.
Whether this is tolerable depends on the application.
One example usage pattern involves performing the heavy computing quickly with
in-memory representations on HPC systems with plentiful memory, and then
distributing file-backed counterparts to individual users for exploration and
visualization on their personal machines.
Needless to say, data access from file-backed representations is slower than
that from in-memory representations. The time spent retrieving data from disk is
an unavoidable cost of reducing memory usage. Whether this is tolerable depends
on the application. One example usage pattern involves performing the heavy
computing quickly with in-memory representations on HPC systems with plentiful
memory, and then distributing file-backed counterparts to individual users for
exploration and visualization on their personal machines.

## Parallelization

Expand Down Expand Up @@ -197,15 +193,34 @@ rs <- list(walltime = 7200, memory = 8000, ncpus = 1)
bpp <- BatchtoolsParam(10, cluster = "slurm", resources = rs)
```

Parallelization is best suited for CPU-intensive calculations where the division
of labor results in a concomitant reduction in compute time. It is not suited
for tasks that are bounded by other compute resources, e.g., memory or file I/O
(though the latter is less of an issue on HPC systems with parallel read/write).
In particular, R itself is inherently single-core, so many of the
Parallelization is best suited for independent, CPU-intensive tasks where the
division of labor results in a concomitant reduction in compute time. It is not
suited for tasks that are bounded by other compute resources, e.g., memory or
file I/O (though the latter is less of an issue on HPC systems with parallel
read/write). In particular, R itself is inherently single-core, so many of the
parallelization backends involve (i) setting up one or more separate R sessions,
(ii) loading the relevant packages and (iii) transmitting the data to that
session. Depending on the nature and size of the task, this overhead may
outweigh any benefit from parallel computing.
outweigh any benefit from parallel computing. While the default behavior of the
parallel job managers often works well for simple cases, it is sometimes
necessary to explicitly specify what data/libraries are sent to / loaded on the
parallel workers in order to avoid unnecessary overhead.

:::: challenge

How do you turn on progress bars with parallel processing?

::: solution

From `?MulticoreParam` :

> `progressbar` logical(1) Enable progress bar (based on plyr:::progress_text). Enabling the progress bar changes the default value of tasks to .Machine$integer.max, so that progress is reported for each element of X.
Progress bars are a helpful way to gauge whether that task is going to take 5 minutes or 5 hours.

:::

::::

## Fast approximations

Expand All @@ -230,15 +245,14 @@ colLabels(sce) <- clusterCells(sce, use.dimred = "PCA",
BLUSPARAM = NNGraphParam(cluster.fun = "louvain"))
```

The above clusters on a nearest neighbor graph generated with
an exact neighbour search.
We repeat this below using an approximate search, implemented using the
[Annoy](https://github.com/spotify/Annoy) algorithm.
This involves constructing a `AnnoyParam` object to specify the search algorithm
and then passing it to the parameterization of the `NNGraphParam()` function.
The results from the exact and approximate searches are consistent with most
clusters from the former re-appearing in the latter.
This suggests that the inaccuracy from the approximation can be largely ignored.
The above clusters on a nearest neighbor graph generated with an exact neighbour
search. We repeat this below using an approximate search, implemented using the
[Annoy](https://github.com/spotify/Annoy) algorithm. This involves constructing
a `AnnoyParam` object to specify the search algorithm and then passing it to the
parameterization of the `NNGraphParam()` function. The results from the exact
and approximate searches are consistent with most clusters from the former
re-appearing in the latter. This suggests that the inaccuracy from the
approximation can be largely ignored.

```{r cluster-annoy}
library(scran)
Expand All @@ -249,9 +263,10 @@ clusters <- clusterCells(sce, use.dimred = "PCA",
table(exact = colLabels(sce), approx = clusters)
```

This can be quantified by calculating the pairwise Rand index:

```{r check-annoy, echo = FALSE}
# Check that clusterings are very similar
library(bluster)
rand <- pairwiseRand(colLabels(sce), clusters, mode = "index")
stopifnot(rand > 0.85)
```
Expand All @@ -277,15 +292,13 @@ mean(exact$index != approx$index)
### Singular value decomposition

The singular value decomposition (SVD) underlies the PCA used throughout our
analyses, e.g., in `denoisePCA()`, `fastMNN()`, `doubletCells()`.
(Briefly, the right singular vectors are the eigenvectors of the gene-gene
covariance matrix, where each eigenvector represents the axis of maximum remaining
variation in the PCA.)
The default `base::svd()` function performs an exact SVD that is not performant
for large datasets.
Instead, we use fast approximate methods from the `r CRANpkg("irlba")` and
`r CRANpkg("rsvd")` packages, conveniently wrapped into the
`r Biocpkg("BiocSingular")` package for ease of use and package development.
analyses, e.g., in `denoisePCA()`, `fastMNN()`, `doubletCells()`. (Briefly, the
right singular vectors are the eigenvectors of the gene-gene covariance matrix,
where each eigenvector represents the axis of maximum remaining variation in the
PCA.) The default `base::svd()` function performs an exact SVD that is not
performant for large datasets. Instead, we use fast approximate methods from the
`r CRANpkg("irlba")` and `r CRANpkg("rsvd")` packages, conveniently wrapped into
the `r Biocpkg("BiocSingular")` package for ease of use and package development.
Specifically, we can change the SVD algorithm used in any of these functions by
simply specifying an alternative value for the `BSPARAM` argument.

Expand All @@ -303,16 +316,57 @@ i.out <- runPCA(sce, ncomponents = 20, BSPARAM = IrlbaParam())
str(reducedDim(i.out, "PCA"))
```

Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD with
negligible loss of accuracy.
This motivates their default use in many `r Biocpkg("scran")` and `r Biocpkg("scater")`
functions, at the cost of requiring users to set the seed to guarantee reproducibility.
IRLBA can occasionally fail to converge and require more iterations
(passed via `maxit=` in `IrlbaParam()`), while RSVD involves an explicit trade-off
between accuracy and speed based on its oversampling parameter (`p=`) and number
of power iterations (`q=`).
We tend to prefer IRLBA as its default behavior is more accurate, though RSVD is
much faster for file-backed matrices.
Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD and
usually yield only a negligible loss of accuracy. This motivates their default
use in many `r Biocpkg("scran")` and `r Biocpkg("scater")` functions, at the
cost of requiring users to set the seed to guarantee reproducibility. IRLBA can
occasionally fail to converge and require more iterations (passed via `maxit=`
in `IrlbaParam()`), while RSVD involves an explicit trade-off between accuracy
and speed based on its oversampling parameter (`p=`) and number of power
iterations (`q=`). We tend to prefer IRLBA as its default behavior is more
accurate, though RSVD is much faster for file-backed matrices.

:::: challenge

The uncertainty from approximation error is sometimes psychologically
objectionable. "Why can't my computer just give me the right answer?" One way to
alleviate this feeling is to quantify the approximation error on a small test
set like the sce we have here. Using the `ExactParam()` class, visualize the
error in PC1 coordinates compared to the RSVD results.

::: solution
This code block calculates the exact PCA coordinates. Another thing to note: PC vectors are only identified up to a sign flip. We can see that the RSVD PC1 vector points in the
```{r}
set.seed(123)
e.out <- runPCA(sce, ncomponents = 20, BSPARAM = ExactParam())
str(reducedDim(e.out, "PCA"))
reducedDim(e.out, "PCA")[1:5,1:3]
reducedDim(r.out, "PCA")[1:5,1:3]
```

For the sake of visualizing the error we can just flip the PC1 coordinates:

```{r}
reducedDim(r.out, "PCA") = -1 * reducedDim(r.out, "PCA")
```

From there we can visualize the error with a histogram:

```{r}
error = reducedDim(r.out, "PCA")[,"PC1"] -
reducedDim(e.out, "PCA")[,"PC1"]
data.frame(approx_error = error) |>
ggplot(aes(approx_error)) +
geom_histogram()
```

It's almost never more than .001 in this case.

:::

::::

## Interoperability with popular single-cell analysis ecosytems

Expand All @@ -330,16 +384,15 @@ library(Seurat)
```

Although the basic processing of single-cell data with Bioconductor packages
(described in the [OSCA book](https://bioconductor.org/books/release/OSCA/))
and with Seurat is very similar and will produce overall roughly identical
results, there is also complementary functionality with regard to cell type
annotation, dataset integration, and downstream analysis.
To make the most of both ecosystems it is therefore beneficial to be able
to easily switch between a `SeuratObject` and a `SingleCellExperiment`.
See also the Seurat
[conversion vignette](https://satijalab.org/seurat/articles/conversion_vignette.html)
for conversion to/from other popular single cell formats such as the AnnData
format used by [scanpy](https://scanpy.readthedocs.io/en/stable/).
(described in the [OSCA book](https://bioconductor.org/books/release/OSCA/)) and
with Seurat is very similar and will produce overall roughly identical results,
there is also complementary functionality with regard to cell type annotation,
dataset integration, and downstream analysis. To make the most of both
ecosystems it is therefore beneficial to be able to easily switch between a
`SeuratObject` and a `SingleCellExperiment`. See also the Seurat [conversion
vignette](https://satijalab.org/seurat/articles/conversion_vignette.html) for
conversion to/from other popular single cell formats such as the AnnData format
used by [scanpy](https://scanpy.readthedocs.io/en/stable/).

Here, we demonstrate converting the Seurat object produced in Seurat's
[PBMC tutorial](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)
Expand Down Expand Up @@ -395,14 +448,14 @@ sobj

[Scanpy](https://scanpy.readthedocs.io) is a scalable toolkit for analyzing
single-cell gene expression data built jointly with
[anndata](https://anndata.readthedocs.io/). It includes
preprocessing, visualization, clustering, trajectory inference and differential
expression testing. The Python-based implementation efficiently deals with
datasets of more than one million cells. Scanpy is developed and maintained by
the [Theis lab]() and is released under a
[BSD-3-Clause license](https://github.com/scverse/scanpy/blob/master/LICENSE).
Scanpy is part of the [scverse](https://scverse.org/), a Python-based ecosystem
for single-cell omics data analysis.
[anndata](https://anndata.readthedocs.io/). It includes preprocessing,
visualization, clustering, trajectory inference and differential expression
testing. The Python-based implementation efficiently deals with datasets of more
than one million cells. Scanpy is developed and maintained by the [Theis lab]()
and is released under a [BSD-3-Clause
license](https://github.com/scverse/scanpy/blob/master/LICENSE). Scanpy is part
of the [scverse](https://scverse.org/), a Python-based ecosystem for single-cell
omics data analysis.

At the core of scanpy's single-cell functionality is the `anndata` data structure,
scanpy's integrated single-cell data container, which is conceptually very similar
Expand Down Expand Up @@ -433,7 +486,7 @@ We can also write a `SingleCellExperiment` to an H5AD file with the
chimera mouse gastrulation dataset.

```{r write-h5ad, message = FALSE}
out.file <- tempfile(pattern = ".h5ad")
out.file <- tempfile(fileext = ".h5ad")
writeH5AD(sce, file = out.file)
```

Expand All @@ -454,10 +507,10 @@ sessionInfo()

#### Exercise 1: Out of memory representation

Write the counts matrix of the wild-type chimera
mouse gastrulation dataset to an HDF5 file. Create another counts matrix that
reads the data from the HDF5 file. Compare memory usage of holding the entire
matrix in memory as opposed to holding the data out of memory.
Write the counts matrix of the wild-type chimera mouse gastrulation dataset to
an HDF5 file. Create another counts matrix that reads the data from the HDF5
file. Compare memory usage of holding the entire matrix in memory as opposed to
holding the data out of memory.

:::::::::::::: hint

Expand All @@ -468,7 +521,23 @@ function for writing to HDF5 from the `r Biocpkg("HDF5Array")` package.

:::::::::::::: solution

TODO
```{r}
wt_out = tempfile(fileext = ".h5")
wt_counts = counts(WTChimeraData())
writeHDF5Array(wt_counts,
name = "wt_counts",
file = wt_out)
oom_wt = HDF5Array(wt_out, "wt_counts")
object.size(wt_counts)
object.size(oom_wt)
```

:::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::::::::
Expand Down

0 comments on commit a189724

Please sign in to comment.