diff --git a/README.md b/README.md index c7135bb..aa83c85 100644 --- a/README.md +++ b/README.md @@ -8,14 +8,15 @@ the analysis of single-cell data has been developed, making it hard to understan the critical steps in the analysis workflow and the best methods for each objective of one’s study. -This tutorial aims to provide a solid foundation in using Bioconductor tools -for single-cell RNA-seq analysis by walking through various steps of typical -workflows using example datasets. +This [Carpentries-style](https://carpentries.org/) tutorial aims to provide a +solid foundation in using [Bioconductor](https://bioconductor.org) tools for +single-cell RNA-seq analysis by walking through various steps of typical workflows +using example datasets. This tutorial uses as a "text-book" the online book "Orchestrating Single-Cell -Analysis with Bioconductor" -([OSCA](https://bioconductor.org/books/release/OSCA/)), -started in 2018 and continuously updated by many contributors from the Bioconductor +Analysis with Bioconductor" ([OSCA](https://bioconductor.org/books/release/OSCA/)), +[published in 2020](https://doi.org/10.1038%2Fs41592-019-0654-x), +and continuously updated by many contributors from the Bioconductor community. Like the book, this tutorial strives to be of interest to the experimental biologists wanting to analyze their data and to the bioinformaticians approaching single-cell data. @@ -35,11 +36,45 @@ In particular, participants will learn: * How to correct for batch effects and integrate multiple samples. * How to perform differential expression and differential abundance analysis between conditions. * How to work with large out-of-memory datasets. +* How to interoperate with other popular single-cell analysis ecosystems. -## Source +## Other tools and tutorials for single-cell analysis + +The focus of this tutorial is on single-cell analysis with R packages from the +[Bioconductor](https://bioconductor.org) repository. Bioconductor packages are +collaboratively developed by an international community of developers that agree +on data and software standards to promote interoperability between packages, +extensibility of analysis workflows, and reproducibility of published research. + +Other popular tools for single-cell analysis include: + +* [Seurat](https://satijalab.org/seurat/), a stand-alone R package that has +pioneered elementary steps of typical single-cell analysis workflows, and +* [scverse](https://scverse.org/), a collection of Python packages for single-cell +omics data analysis including [scanpy](https://scanpy.readthedocs.io) and +[scvi-tools](https://scvi-tools.org/). -This lesson uses [The Carpentries Workbench](https://carpentries.github.io/sandpaper-docs/) and is based on materials from the [OSCA tutorial at the ISMB 2023](https://bioconductor.github.io/ISMB.OSCA/). +Tutorials for working with these tools are available elsewhere and are not covered +in this tutorial. A demonstration of how to interoperate with `Seurat` and packages +from the `scverse` is given in [Session 5](https://ccb-hms.github.io/osca-workbench/large_data.html) +of this tutorial. + +Other Carpentries-style tutorials for single-cell analysis with a different scope include: + +- a [community-developed lesson](https://carpentries-incubator.github.io/scrna-seq-analysis/) + that makes use of command-line utilities and `scanpy` for basic preprocessing steps, +- and a [tutorial proposal](https://github.com/carpentries-incubator/proposals/issues/178) +based on `Seurat`. + +## Source -As individual vignettes are converted into lessons, they can be added to `config.yaml` to be rendered and shown in the final Github Pages lesson. +This lesson uses [The Carpentries Workbench](https://carpentries.github.io/sandpaper-docs/) +and is based on materials from the [OSCA tutorial at the ISMB 2023](https://bioconductor.github.io/ISMB.OSCA/). +## Citation +Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, Marini F, +Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pagès H, Smith ML, Huber W, +Morgan M, Gottardo R, Hicks SC. Orchestrating single-cell analysis with +Bioconductor. *Nature Methods*, 2020. +doi: [10.1038/s41592-019-0654-x](https://doi.org/10.1038/s41592-019-0654-x) diff --git a/episodes/intro-sce.Rmd b/episodes/intro-sce.Rmd index 95e7dd8..612404b 100644 --- a/episodes/intro-sce.Rmd +++ b/episodes/intro-sce.Rmd @@ -31,7 +31,7 @@ options(digits = 3) ### Overview -Within the R ecosystem, the Bioconductor project provides tools for the analysis and comprehension of high-throughput genomics data. +Within the [R](https://www.r-project.org/) ecosystem, the [Bioconductor](https://bioconductor.org/) project provides tools for the analysis and comprehension of high-throughput genomics data. The scope of the project covers microarray data, various forms of sequencing (RNA-seq, ChIP-seq, bisulfite, genotyping, etc.), proteomics, flow cytometry and more. One of Bioconductor's main selling points is the use of common data structures to promote interoperability between packages, allowing code written by different people (from different organizations, in different countries) to work together seamlessly in complex analyses. @@ -89,20 +89,21 @@ This will check for more recent versions of each package (within a Bioconductor BiocManager::install() ``` -Be careful: if you have a lot of packages to update, this can take a long time. +This might take some time if many packages need to be updated, but is typically +recommended to avoid issues resulting from outdated package versions. ## The `SingleCellExperiment` class ### Setup -First start by loading some libraries we'll be using: +We start by loading some libraries we'll be using: ```{r setup} library(SingleCellExperiment) library(MouseGastrulationData) ``` -It's normal to see lot of startup messages when loading these packages. +It is normal to see lot of startup messages when loading these packages. ### Motivation and overview @@ -118,11 +119,15 @@ knitr::include_graphics("http://bioconductor.org/books/release/OSCA.intro/images :::: spoiler -### Before `SingleCellExperiment` +### Benefits of using the integrated `SingleCellExperiment` data container -Before `SingleCellExperiment`, coders working with single cell data would sometimes keep all of these components in separate objects e.g. a matrix of counts, a data.frame of sample metadata, a data.frame of gene annotations and so on. There were two main disadvantages to this sort of "from scratch" approach: +The complexity of the `SingleCellExperiment` container might be a little bit intimidating in the beginning. +One might be tempted to use a simpler approach by just keeping all of these components in separate objects, +e.g. a `matrix` of counts, a `data.frame` of sample metadata, a `data.frame` of gene annotations, and so on. -1. Tons of book-keeping. If you performed a QC step that removed dead cells, you also had to remember to remove that same set of cells from the cell-wise metadata. Un-expressed genes were dropped? Don't forget to filter the gene metadata table too. +There are two main disadvantages to this "from-scratch" approach: + +1. It requires a substantial amount of manual bookkeeping to keep the different data components in sync. If you performed a QC step that removed dead cells from the count matrix, you also had to remember to remove that same set of cells from the cell-wise metadata. Did you filtered out genes that did not display sufficient expression levels to be retained for further analysis? Then you would need to make sure to not forget to filter the gene metadata table too. 2. All the downstream steps had to be "from scratch" as well. All the data munging, analysis, and visualization code had to be customized to the idiosyncrasies of a given input set. :::: @@ -141,11 +146,11 @@ Depending on the object, slots can contain different types of data (e.g., numeri :::: challenge -Try to get the data for a different sample from `WTChimeraData` (other than the fifth one). +Get the data for a different sample from `WTChimeraData` (other than the fifth one). ::: solution -Here we assign the sixth sample to `sce6`: +Here we obtain the sixth sample and assign it to `sce6`: ```{r, message = FALSE, warning=FALSE, eval=FALSE} sce6 <- WTChimeraData(samples = 6) @@ -167,7 +172,11 @@ names(assays(sce)) counts(sce)[1:3, 1:3] ``` -You will notice that in this case we have a sparse matrix of class "dgTMatrix" inside the object. More generally, any "matrix-like" object can be used, e.g., dense matrices or HDF5-backed matrices (see the "Working with large data" episode). +You will notice that in this case we have a sparse matrix of class `dgTMatrix` +inside the object. More generally, any "matrix-like" object can be used, e.g., +dense matrices or HDF5-backed matrices (as we will explore later in the +[Working with large data](https://ccb-hms.github.io/osca-workbench/large_data.html) +episode). ### `colData` and `rowData` @@ -181,7 +190,7 @@ colData(sce)[1:3, 1:4] rowData(sce)[1:3, 1:2] ``` -You can access columns of the colData with the `$` accessor to quickly add cell-wise metadata to the colData +You can access columns of the colData with the `$` accessor to quickly add cell-wise metadata to the `colData`. ```{r} sce$my_sum <- colSums(counts(sce)) @@ -191,17 +200,18 @@ colData(sce)[1:3,] :::: challenge -Try to add a column of gene-wise metadata to the rowData. +Add a column of gene-wise metadata to the `rowData`. ::: solution -Here we add a column called "conservation" that is just an integer sequence from 1 to the number of genes. +Here, we add a column named `conservation` that could represent an evolutionary conservation score. ```{r} -rowData(sce)$conservation = rnorm(nrow(sce)) +rowData(sce)$conservation <- rnorm(nrow(sce)) ``` -This is just a made-up example with a simple sequence of numbers, but in practice its convenient to store any sort of gene-wise information in the columns of the rowData. +This is just an example for demonstration purposes, but in practice it is convenient +and simplifies data management to store any sort of gene-wise information in the columns of the `rowData`. ::: @@ -209,7 +219,7 @@ This is just a made-up example with a simple sequence of numbers, but in practic ### The `reducedDims` -Everything that we have described so far (except for the `counts` getter) is part of the `SummarizedExperiment` class that SingleCellExperiment extends. You can find a complete lesson on the `SummarizedExperiment` class in [Introduction to data analysis with R and Bioconductor](https://carpentries-incubator.github.io/bioc-intro/60-next-steps.html) course. +Everything that we have described so far (except for the `counts` getter) is part of the `SummarizedExperiment` class that `SingleCellExperiment` extends. You can find a complete lesson on the `SummarizedExperiment` class in [Introduction to data analysis with R and Bioconductor](https://carpentries-incubator.github.io/bioc-intro/60-next-steps.html) course. One peculiarity of `SingleCellExperiment` is its ability to store reduced dimension matrices within the object. These may include PCA, t-SNE, UMAP, etc. @@ -233,7 +243,7 @@ plotReducedDim(sce, "pca.corrected.E8.5", colour_by = "celltype.mapped") #### Exercise 1 -Create a `SingleCellExperiment` object: Try and create a `SingleCellExperiment` object "from scratch". Start from a `matrix` (either randomly generated or with some fake data in it) and add one or more columns as `colData`. +Create a `SingleCellExperiment` object "from scratch". That means: start from a `matrix` (either randomly generated or with some fake data in it) and add one or more columns as `colData`. :::::::::::::: hint @@ -262,7 +272,7 @@ my_sce #### Exercise 2 -Combining two objects: The `MouseGastrulationData` package contains several datasets. Download sample 6 of the chimera experiment by running `sce6 <- WTChimeraData(samples=6)`. Use the `cbind` function to combine the new data with the `sce` object created before. +Combine two `SingleCellExperiment` objects. The `MouseGastrulationData` package contains several datasets. Download sample 6 of the chimera experiment. Use the `cbind` function to combine the new data with the `sce` object created before. ::: solution diff --git a/index.md b/index.md index 15b10ba..0bb13d7 100644 --- a/index.md +++ b/index.md @@ -8,21 +8,18 @@ in individual cells has become routine. Consequently, a plethora of tools for the analysis of single-cell data has been developed, making it hard to understand the critical steps in the analysis workflow and the best methods for each objective of one’s study. -This tutorial aims to provide a solid foundation in using [Bioconductor](https://bioconductor.org) +This [Carpentries-style](https://carpentries.org/) tutorial aims to provide a +solid foundation in using [Bioconductor](https://bioconductor.org) tools for single-cell RNA-seq (scRNA-seq) analysis by walking through various steps of typical workflows using example datasets. This tutorial is based on the the online book "Orchestrating Single-Cell Analysis with Bioconductor" ([OSCA](https://bioconductor.org/books/release/OSCA/)), -started in 2018 and continuously updated by many contributors from the Bioconductor community. +[published in 2020](https://doi.org/10.1038%2Fs41592-019-0654-x), +and continuously updated by many contributors from the Bioconductor community. Like the book, this tutorial strives to be of interest to the experimental biologists wanting to analyze their data and to the bioinformaticians approaching single-cell data. -This is a new lesson built with [The Carpentries Workbench][workbench]. - - -[workbench]: https://carpentries.github.io/sandpaper-docs - :::::::::::::::::::::::::::::::::::::::::: prereq ## Prerequisites @@ -30,10 +27,20 @@ This is a new lesson built with [The Carpentries Workbench][workbench]. - Familiarity with R/Bioconductor, such as the [Introduction to data analysis with R and Bioconductor](https://carpentries-incubator.github.io/bioc-intro/) lesson. +- Familiarity with multivariate analysis and dimensionality reduction, such as +[Chapter 7](https://www.huber.embl.de/msmb/07-chap.html) of the book +*Modern Statistics for Modern Biology* by Holmes and Huber. - Familiarity with the biology of gene expression and scRNA-seq, such as the review article [A practical guide to single-cell RNA-sequencing](https://doi.org/10.1186/s13073-017-0467-4) by Haque et.al. :::::::::::::::::::::::::::::::::::::::::::::::::: +If you use materials of this lesson in published research, please cite: + +Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, Marini F, +Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pagès H, Smith ML, Huber W, +Morgan M, Gottardo R, Hicks SC. Orchestrating single-cell analysis with +Bioconductor. *Nature Methods*, 2020. +doi: [10.1038/s41592-019-0654-x](https://doi.org/10.1038/s41592-019-0654-x)