generated from carpentries/workbench-template-rmd
-
-
Notifications
You must be signed in to change notification settings - Fork 1
/
large_data.Rmd
648 lines (461 loc) · 22.8 KB
/
large_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
---
title: Working with large data
teaching: 10 # Minutes of teaching in the lesson
exercises: 2 # Minutes of exercises in the lesson
---
:::::::::::::::::::::::::::::::::::::: questions
- How do we work with single-cell datasets that are too large to fit in memory?
- How do we speed up single-cell analysis workflows for large datasets?
- How do we convert between popular single-cell data formats?
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: objectives
- Work with out-of-memory data representations such as HDF5.
- Speed up single-cell analysis with parallel computation.
- Invoke fast approximations for essential analysis steps.
- Convert `SingleCellExperiment` objects to `SeuratObject`s and `AnnData` objects.
::::::::::::::::::::::::::::::::::::::::::::::::
```{r chunk-opts, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, digits = 3)
library(BiocStyle)
```
## Motivation
Advances in scRNA-seq technologies have increased the number of cells that can
be assayed in routine experiments.
Public databases such as [GEO](https://www.ncbi.nlm.nih.gov/geo/) are continually
expanding with more scRNA-seq studies,
while large-scale projects such as the
[Human Cell Atlas](https://www.humancellatlas.org/) are expected to generate
data for billions of cells.
For effective data analysis, the computational methods need to scale with the
increasing size of scRNA-seq data sets.
This section discusses how we can use various aspects of the Bioconductor
ecosystem to tune our analysis pipelines for greater speed and efficiency.
## Out of memory representations
The count matrix is the central structure around which our analyses are based.
In most of the previous chapters, this has been held fully in memory as a dense
`matrix` or as a sparse `dgCMatrix`.
Howevever, in-memory representations may not be feasible for very large data sets,
especially on machines with limited memory.
For example, the 1.3 million brain cell data set from 10X Genomics
([Zheng et al., 2017](https://doi.org/10.1038/ncomms14049))
would require over 100 GB of RAM to hold as a `matrix` and around 30 GB as a `dgCMatrix`.
This makes it challenging to explore the data on anything less than a HPC system.
The obvious solution is to use a file-backed matrix representation where the
data are held on disk and subsets are retrieved into memory as requested. While
a number of implementations of file-backed matrices are available (e.g.,
[bigmemory](https://cran.r-project.org/web/packages/bigmemory/index.html),
[matter](https://bioconductor.org/packages/matter)), we will be using the
implementation from the [HDF5Array](https://bioconductor.org/packages/HDF5Array)
package. This uses the popular HDF5 format as the underlying data store, which
provides a measure of standardization and portability across systems. We
demonstrate with a subset of 20,000 cells from the 1.3 million brain cell data
set, as provided by the
[TENxBrainData](https://bioconductor.org/packages/TENxBrainData) package.
```{r tenx-brain}
library(TENxBrainData)
sce.brain <- TENxBrainData20k()
sce.brain
```
Examination of the `SingleCellExperiment` object indicates that the count matrix
is a `HDF5Matrix`.
From a comparison of the memory usage, it is clear that this matrix object is
simply a stub that points to the much larger HDF5 file that actually contains
the data.
This avoids the need for large RAM availability during analyses.
```{r tenx-brain-size}
counts(sce.brain)
object.size(counts(sce.brain))
file.info(path(counts(sce.brain)))$size
```
Manipulation of the count matrix will generally result in the creation of a
`DelayedArray` object from the
[DelayedArray](https://bioconductor.org/packages/DelayedArray) package.
This remembers the operations to be applied to the counts and stores them in
the object, to be executed when the modified matrix values are realized for use
in calculations.
The use of delayed operations avoids the need to write the modified values to a
new file at every operation, which would unnecessarily require time-consuming disk I/O.
```{r}
tmp <- counts(sce.brain)
tmp <- log2(tmp + 1)
tmp
```
Many functions described in the previous workflows are capable of accepting
`HDF5Matrix` objects.
This is powered by the availability of common methods for all matrix
representations (e.g., subsetting, combining, methods from
[DelayedMatrixStats](https://bioconductor.org/packages/DelayedMatrixStats)
as well as representation-agnostic C++ code
using [beachmat](https://bioconductor.org/packages/beachmat).
For example, we compute QC metrics below with the same `calculateQCMetrics()`
function that we used in the other workflows.
```{r, message = FALSE}
library(scater)
is.mito <- grepl("^mt-", rowData(sce.brain)$Symbol)
qcstats <- perCellQCMetrics(sce.brain, subsets = list(Mt = is.mito))
qcstats
```
Needless to say, data access from file-backed representations is slower than
that from in-memory representations. The time spent retrieving data from disk is
an unavoidable cost of reducing memory usage. Whether this is tolerable depends
on the application. One example usage pattern involves performing the heavy
computing quickly with in-memory representations on HPC systems with plentiful
memory, and then distributing file-backed counterparts to individual users for
exploration and visualization on their personal machines.
## Parallelization
Parallelization of calculations across genes or cells is an obvious strategy for
speeding up scRNA-seq analysis workflows.
The `r Biocpkg("BiocParallel")` package provides a common interface for parallel
computing throughout the Bioconductor ecosystem, manifesting as a `BPPARAM`
argument in compatible functions. We can also use `BiocParallel` with more
expressive functions directly through the package's interface.
#### Basic use
```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(BiocParallel)
```
`BiocParallel` makes it quite easy to iterate over a vector and distribute the
computation across workers using the `bplapply` function. Basic knowledge
of `lapply` is required.
In this example, we find the square root of a vector of numbers in parallel
by indicating the `BPPARAM` argument in `bplapply`.
```{r}
param <- MulticoreParam(workers = 1)
bplapply(
X = c(4, 9, 16, 25),
FUN = sqrt,
BPPARAM = param
)
```
**Note**. The number of workers is set to 1 due to continuous testing resource
limitations.
There exists a diverse set of parallelization backends depending on available
hardware and operating systems.
For example, we might use forking across two cores to parallelize the variance
calculations on a Unix system:
```{r parallel-mc, message = FALSE}
library(MouseGastrulationData)
library(scran)
sce <- WTChimeraData(samples = 5, type = "processed")
sce <- logNormCounts(sce)
dec.mc <- modelGeneVar(sce, BPPARAM = MulticoreParam(2))
dec.mc
```
Another approach would be to distribute jobs across a network of computers,
which yields the same result:
```{r parallel-snow, eval=FALSE}
dec.snow <- modelGeneVar(sce, BPPARAM = SnowParam(2))
```
For high-performance computing (HPC) systems with a cluster of compute nodes,
we can distribute jobs via the job scheduler using the `BatchtoolsParam` class.
The example below assumes a SLURM cluster, though the settings can be easily
configured for a particular system
(see `r Biocpkg("BiocParallel", "BiocParallel_BatchtoolsParam.pdf", "here")` for
details).
```{r batch-param, eval=FALSE}
# 2 hours, 8 GB, 1 CPU per task, for 10 tasks.
rs <- list(walltime = 7200, memory = 8000, ncpus = 1)
bpp <- BatchtoolsParam(10, cluster = "slurm", resources = rs)
```
Parallelization is best suited for independent, CPU-intensive tasks where the
division of labor results in a concomitant reduction in compute time. It is not
suited for tasks that are bounded by other compute resources, e.g., memory or
file I/O (though the latter is less of an issue on HPC systems with parallel
read/write). In particular, R itself is inherently single-core, so many of the
parallelization backends involve (i) setting up one or more separate R sessions,
(ii) loading the relevant packages and (iii) transmitting the data to that
session. Depending on the nature and size of the task, this overhead may
outweigh any benefit from parallel computing. While the default behavior of the
parallel job managers often works well for simple cases, it is sometimes
necessary to explicitly specify what data/libraries are sent to / loaded on the
parallel workers in order to avoid unnecessary overhead.
:::: challenge
How do you turn on progress bars with parallel processing?
::: solution
From `?MulticoreParam` :
> `progressbar` logical(1) Enable progress bar (based on plyr:::progress_text). Enabling the progress bar changes the default value of tasks to .Machine$integer.max, so that progress is reported for each element of X.
Progress bars are a helpful way to gauge whether that task is going to take 5 minutes or 5 hours.
:::
::::
## Fast approximations
### Nearest neighbor searching
Identification of neighbouring cells in PC or expression space is a common procedure
that is used in many functions, e.g., `buildSNNGraph()`, `doubletCells()`.
The default is to favour accuracy over speed by using an exact nearest neighbour
(NN) search, implemented with the $k$-means for $k$-nearest neighbours algorithm.
However, for large data sets, it may be preferable to use a faster approximate
approach.
The `r Biocpkg("BiocNeighbors")` framework makes it easy to switch between search
options by simply changing the `BNPARAM` argument in compatible functions.
To demonstrate, we will use the wild-type chimera data for which we had applied
graph-based clustering using the Louvain algorithm for community detection:
```{r pca-and-cluster}
library(bluster)
sce <- runPCA(sce)
colLabels(sce) <- clusterCells(sce, use.dimred = "PCA",
BLUSPARAM = NNGraphParam(cluster.fun = "louvain"))
```
The above clusters on a nearest neighbor graph generated with an exact neighbour
search. We repeat this below using an approximate search, implemented using the
[Annoy](https://github.com/spotify/Annoy) algorithm. This involves constructing
a `AnnoyParam` object to specify the search algorithm and then passing it to the
parameterization of the `NNGraphParam()` function. The results from the exact
and approximate searches are consistent with most clusters from the former
re-appearing in the latter. This suggests that the inaccuracy from the
approximation can be largely ignored.
```{r cluster-annoy}
library(scran)
library(BiocNeighbors)
clusters <- clusterCells(sce, use.dimred = "PCA",
BLUSPARAM = NNGraphParam(cluster.fun = "louvain",
BNPARAM = AnnoyParam()))
table(exact = colLabels(sce), approx = clusters)
```
The similarity of the two clusterings can be quantified by calculating the pairwise Rand index:
```{r check-annoy}
rand <- pairwiseRand(colLabels(sce), clusters, mode = "index")
stopifnot(rand > 0.8)
```
Note that Annoy writes the NN index to disk prior to performing the search.
Thus, it may not actually be faster than the default exact algorithm for small
datasets, depending on whether the overhead of disk write is offset by the
computational complexity of the search.
It is also not difficult to find situations where the approximation deteriorates,
especially at high dimensions, though this may not have an appreciable impact on
the biological conclusions.
```{r bad-approx}
set.seed(1000)
y1 <- matrix(rnorm(50000), nrow = 1000)
y2 <- matrix(rnorm(50000), nrow = 1000)
Y <- rbind(y1, y2)
exact <- findKNN(Y, k = 20)
approx <- findKNN(Y, k = 20, BNPARAM = AnnoyParam())
mean(exact$index != approx$index)
```
### Singular value decomposition
The singular value decomposition (SVD) underlies the PCA used throughout our
analyses, e.g., in `denoisePCA()`, `fastMNN()`, `doubletCells()`. (Briefly, the
right singular vectors are the eigenvectors of the gene-gene covariance matrix,
where each eigenvector represents the axis of maximum remaining variation in the
PCA.) The default `base::svd()` function performs an exact SVD that is not
performant for large datasets. Instead, we use fast approximate methods from the
`r CRANpkg("irlba")` and `r CRANpkg("rsvd")` packages, conveniently wrapped into
the `r Biocpkg("BiocSingular")` package for ease of use and package development.
Specifically, we can change the SVD algorithm used in any of these functions by
simply specifying an alternative value for the `BSPARAM` argument.
```{r fast-svd}
library(scater)
library(BiocSingular)
# As the name suggests, it is random, so we need to set the seed.
set.seed(101000)
r.out <- runPCA(sce, ncomponents = 20, BSPARAM = RandomParam())
str(reducedDim(r.out, "PCA"))
set.seed(101001)
i.out <- runPCA(sce, ncomponents = 20, BSPARAM = IrlbaParam())
str(reducedDim(i.out, "PCA"))
```
Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD and
usually yield only a negligible loss of accuracy. This motivates their default
use in many `r Biocpkg("scran")` and `r Biocpkg("scater")` functions, at the
cost of requiring users to set the seed to guarantee reproducibility. IRLBA can
occasionally fail to converge and require more iterations (passed via `maxit=`
in `IrlbaParam()`), while RSVD involves an explicit trade-off between accuracy
and speed based on its oversampling parameter (`p=`) and number of power
iterations (`q=`). We tend to prefer IRLBA as its default behavior is more
accurate, though RSVD is much faster for file-backed matrices.
:::: challenge
The uncertainty from approximation error is sometimes aggravating. "Why can't my
computer just give me the right answer?" One way to alleviate this feeling is to
quantify the approximation error on a small test set like the sce we have here.
Using the `ExactParam()` class, visualize the error in PC1 coordinates compared
to the RSVD results.
::: solution
This code block calculates the exact PCA coordinates. Another thing to note: PC vectors are only identified up to a sign flip. We can see that the RSVD PC1 vector points in the
```{r}
set.seed(123)
e.out <- runPCA(sce, ncomponents = 20, BSPARAM = ExactParam())
str(reducedDim(e.out, "PCA"))
reducedDim(e.out, "PCA")[1:5,1:3]
reducedDim(r.out, "PCA")[1:5,1:3]
```
For the sake of visualizing the error we can just flip the PC1 coordinates:
```{r}
reducedDim(r.out, "PCA") = -1 * reducedDim(r.out, "PCA")
```
From there we can visualize the error with a histogram:
```{r}
error <- reducedDim(r.out, "PCA")[,"PC1"] -
reducedDim(e.out, "PCA")[,"PC1"]
data.frame(approx_error = error) |>
ggplot(aes(approx_error)) +
geom_histogram()
```
It's almost never more than .001 in this case.
:::
::::
## Interoperability with popular single-cell analysis ecosytems
### Seurat
[Seurat](https://satijalab.org/seurat) is an R package designed for QC, analysis,
and exploration of single-cell RNA-seq data. Seurat can be used to identify and
interpret sources of heterogeneity from single-cell transcriptomic measurements,
and to integrate diverse types of single-cell data. Seurat is developed and
maintained by the [Satija lab](https://satijalab.org/seurat/authors.html)
and is released under the [MIT license](https://opensource.org/license/mit/).
```{r load-seurat, message = FALSE}
library(Seurat)
```
Although the basic processing of single-cell data with Bioconductor packages
(described in the [OSCA book](https://bioconductor.org/books/release/OSCA/)) and
with Seurat is very similar and will produce overall roughly identical results,
there is also complementary functionality with regard to cell type annotation,
dataset integration, and downstream analysis. To make the most of both
ecosystems it is therefore beneficial to be able to easily switch between a
`SeuratObject` and a `SingleCellExperiment`. See also the Seurat [conversion
vignette](https://satijalab.org/seurat/articles/conversion_vignette.html) for
conversion to/from other popular single cell formats such as the AnnData format
used by [scanpy](https://scanpy.readthedocs.io/en/stable/).
Here, we demonstrate converting the Seurat object produced in Seurat's
[PBMC tutorial](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)
to a `SingleCellExperiment` for further analysis with functionality from
OSCA/Bioconductor.
We therefore need to first install the
[SeuratData](https://github.com/satijalab/seurat-data) package, which is available
from GitHub only.
```{r, eval = FALSE}
BiocManager::install("satijalab/seurat-data")
```
We then proceed by loading all required packages and installing the PBMC dataset:
```{r seurat-data, eval = FALSE}
library(SeuratData)
InstallData("pbmc3k")
```
We then load the dataset as an `SeuratObject` and convert it to a
`SingleCellExperiment`.
```{r seurat_singlecell, eval = FALSE}
# Use PBMC3K from SeuratData
pbmc <- LoadData(ds = "pbmc3k", type = "pbmc3k.final")
pbmc <- UpdateSeuratObject(pbmc)
pbmc
pbmc.sce <- as.SingleCellExperiment(pbmc)
pbmc.sce
```
Seurat also allows conversion from `SingleCellExperiment` objects to Seurat objects;
we demonstrate this here on the wild-type chimera mouse gastrulation dataset.
```{r sce2sobj, message = FALSE, eval = FALSE}
sce <- WTChimeraData(samples = 5, type = "processed")
assay(sce) <- as.matrix(assay(sce))
sce <- logNormCounts(sce)
sce
```
After some processing of the dataset, the actual conversion is carried out with
the `as.Seurat` function.
```{r, warning = FALSE, eval = FALSE}
sobj <- as.Seurat(sce)
Idents(sobj) <- "celltype.mapped"
sobj
```
### Scanpy
[Scanpy](https://scanpy.readthedocs.io) is a scalable toolkit for analyzing
single-cell gene expression data built jointly with
[anndata](https://anndata.readthedocs.io/). It includes preprocessing,
visualization, clustering, trajectory inference and differential expression
testing. The Python-based implementation efficiently deals with datasets of more
than one million cells. Scanpy is developed and maintained by the [Theis lab]()
and is released under a [BSD-3-Clause
license](https://github.com/scverse/scanpy/blob/master/LICENSE). Scanpy is part
of the [scverse](https://scverse.org/), a Python-based ecosystem for single-cell
omics data analysis.
At the core of scanpy's single-cell functionality is the `anndata` data structure,
scanpy's integrated single-cell data container, which is conceptually very similar
to Bioconductor's `SingleCellExperiment` class.
Bioconductor's `r Biocpkg("zellkonverter")` package provides a lightweight
interface between the Bioconductor `SingleCellExperiment` data structure and the
Python `AnnData`-based single-cell analysis environment. The idea is to enable
users and developers to easily move data between these frameworks to construct a
multi-language analysis pipeline across R/Bioconductor and Python.
```{r load-zellkonv, message = FALSE}
library(zellkonverter)
```
The `readH5AD()` function can be used to read a `SingleCellExperiment` from an
H5AD file. Here, we use an example H5AD file contained in the `r Biocpkg("zellkonverter")`
package.
```{r read-h5ad, message = FALSE, warning = FALSE}
example_h5ad <- system.file("extdata", "krumsiek11.h5ad",
package = "zellkonverter")
readH5AD(example_h5ad)
```
We can also write a `SingleCellExperiment` to an H5AD file with the
`writeH5AD()` function. This is demonstrated below on the wild-type
chimera mouse gastrulation dataset.
```{r write-h5ad, message = FALSE}
out.file <- tempfile(fileext = ".h5ad")
writeH5AD(sce, file = out.file)
```
The resulting H5AD file can then be read into Python using scanpy's
[read_h5ad](https://scanpy.readthedocs.io/en/stable/generated/scanpy.read_h5ad.html)
function and then directly used in compatible Python-based analysis frameworks.
## Exercises
:::::::::::::::::::::::::::::::::: challenge
#### Exercise 1: Out of memory representation
Write the counts matrix of the wild-type chimera mouse gastrulation dataset to
an HDF5 file. Create another counts matrix that reads the data from the HDF5
file. Compare memory usage of holding the entire matrix in memory as opposed to
holding the data out of memory.
:::::::::::::: hint
See the `HDF5Array` function for reading from HDF5 and the `writeHDF5Array`
function for writing to HDF5 from the `r Biocpkg("HDF5Array")` package.
:::::::::::::::::::::::
:::::::::::::: solution
```{r}
wt_out <- tempfile(fileext = ".h5")
wt_counts <- counts(WTChimeraData())
writeHDF5Array(wt_counts,
name = "wt_counts",
file = wt_out)
oom_wt <- HDF5Array(wt_out, "wt_counts")
object.size(wt_counts)
object.size(oom_wt)
```
:::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::: challenge
#### Exercise 2: Parallelization
Perform a PCA analysis of the wild-type chimera mouse gastrulation
dataset using a multicore backend for parallel computation. Compare the runtime
of performing the PCA either in serial execution mode, in multicore execution
mode with 2 workers, and in multicore execution mode with 3 workers.
:::::::::::::: hint
Use the function `system.time` to obtain the runtime of each job.
:::::::::::::::::::::::
:::::::::::::: solution
```{r eval=FALSE}
sce.brain <- logNormCounts(sce.brain)
system.time({i.out <- runPCA(sce.brain,
ncomponents = 20,
BSPARAM = ExactParam(),
BPPARAM = SerialParam())})
system.time({i.out <- runPCA(sce.brain,
ncomponents = 20,
BSPARAM = ExactParam(),
BPPARAM = MulticoreParam(workers = 2))})
system.time({i.out <- runPCA(sce.brain,
ncomponents = 20,
BSPARAM = ExactParam(),
BPPARAM = MulticoreParam(workers = 3))})
```
:::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::: checklist
## Further Reading
* OSCA book, [Chapter 14](https://bioconductor.org/books/release/OSCA.advanced/dealing-with-big-data.html): Dealing with big data
* The `BiocParallel` `r Biocpkg("BiocParallel", vignette = "Introduction_To_BiocParallel.html", label = "intro vignette")`.
::::::::::::::
::::::::::::::::::::::::::::::::::::: keypoints
- Out-of-memory representations can be used to work with single-cell datasets that are too large to fit in memory.
- Parallelization of calculations across genes or cells is an effective strategy for speeding up analysis of large single-cell datasets.
- Fast approximations for nearest neighbor search and singular value composition can speed up essential steps of single-cell analysis with minimal loss of accuracy.
- Converter functions between existing single-cell data formats enable analysis workflows that leverage complementary functionality from poplular single-cell analysis ecosystems.
::::::::::::::::::::::::::::::::::::::::::::::::
## Session Info
```{r, tidy=TRUE}
sessionInfo()
```