Skip to content

Commit

Permalink
Merge branch 'main' into upd_install
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewGhazi authored Oct 7, 2024
2 parents 69269e7 + d87f1f0 commit 8f633b8
Show file tree
Hide file tree
Showing 3 changed files with 86 additions and 34 deletions.
53 changes: 44 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,15 @@ the analysis of single-cell data has been developed, making it hard to understan
the critical steps in the analysis workflow and the best methods for each objective
of one’s study.

This tutorial aims to provide a solid foundation in using Bioconductor tools
for single-cell RNA-seq analysis by walking through various steps of typical
workflows using example datasets.
This [Carpentries-style](https://carpentries.org/) tutorial aims to provide a
solid foundation in using [Bioconductor](https://bioconductor.org) tools for
single-cell RNA-seq analysis by walking through various steps of typical workflows
using example datasets.

This tutorial uses as a "text-book" the online book "Orchestrating Single-Cell
Analysis with Bioconductor"
([OSCA](https://bioconductor.org/books/release/OSCA/)),
started in 2018 and continuously updated by many contributors from the Bioconductor
Analysis with Bioconductor" ([OSCA](https://bioconductor.org/books/release/OSCA/)),
[published in 2020](https://doi.org/10.1038%2Fs41592-019-0654-x),
and continuously updated by many contributors from the Bioconductor
community. Like the book, this tutorial strives to be of interest to the
experimental biologists wanting to analyze their data and to the bioinformaticians
approaching single-cell data.
Expand All @@ -35,11 +36,45 @@ In particular, participants will learn:
* How to correct for batch effects and integrate multiple samples.
* How to perform differential expression and differential abundance analysis between conditions.
* How to work with large out-of-memory datasets.
* How to interoperate with other popular single-cell analysis ecosystems.

## Source
## Other tools and tutorials for single-cell analysis

The focus of this tutorial is on single-cell analysis with R packages from the
[Bioconductor](https://bioconductor.org) repository. Bioconductor packages are
collaboratively developed by an international community of developers that agree
on data and software standards to promote interoperability between packages,
extensibility of analysis workflows, and reproducibility of published research.

Other popular tools for single-cell analysis include:

* [Seurat](https://satijalab.org/seurat/), a stand-alone R package that has
pioneered elementary steps of typical single-cell analysis workflows, and
* [scverse](https://scverse.org/), a collection of Python packages for single-cell
omics data analysis including [scanpy](https://scanpy.readthedocs.io) and
[scvi-tools](https://scvi-tools.org/).

This lesson uses [The Carpentries Workbench](https://carpentries.github.io/sandpaper-docs/) and is based on materials from the [OSCA tutorial at the ISMB 2023](https://bioconductor.github.io/ISMB.OSCA/).
Tutorials for working with these tools are available elsewhere and are not covered
in this tutorial. A demonstration of how to interoperate with `Seurat` and packages
from the `scverse` is given in [Session 5](https://ccb-hms.github.io/osca-workbench/large_data.html)
of this tutorial.

Other Carpentries-style tutorials for single-cell analysis with a different scope include:

- a [community-developed lesson](https://carpentries-incubator.github.io/scrna-seq-analysis/)
that makes use of command-line utilities and `scanpy` for basic preprocessing steps,
- and a [tutorial proposal](https://github.com/carpentries-incubator/proposals/issues/178)
based on `Seurat`.

## Source

As individual vignettes are converted into lessons, they can be added to `config.yaml` to be rendered and shown in the final Github Pages lesson.
This lesson uses [The Carpentries Workbench](https://carpentries.github.io/sandpaper-docs/)
and is based on materials from the [OSCA tutorial at the ISMB 2023](https://bioconductor.github.io/ISMB.OSCA/).

## Citation

Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, Marini F,
Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pagès H, Smith ML, Huber W,
Morgan M, Gottardo R, Hicks SC. Orchestrating single-cell analysis with
Bioconductor. *Nature Methods*, 2020.
doi: [10.1038/s41592-019-0654-x](https://doi.org/10.1038/s41592-019-0654-x)
46 changes: 28 additions & 18 deletions episodes/intro-sce.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ options(digits = 3)

### Overview

Within the R ecosystem, the Bioconductor project provides tools for the analysis and comprehension of high-throughput genomics data.
Within the [R](https://www.r-project.org/) ecosystem, the [Bioconductor](https://bioconductor.org/) project provides tools for the analysis and comprehension of high-throughput genomics data.

Check warning on line 34 in episodes/intro-sce.Rmd

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[link text too short]: [R](https://www.r-project.org/)
The scope of the project covers microarray data, various forms of sequencing (RNA-seq, ChIP-seq, bisulfite, genotyping, etc.), proteomics, flow cytometry and more.
One of Bioconductor's main selling points is the use of common data structures to promote interoperability between packages, allowing code written by different people (from different organizations, in different countries) to work together seamlessly in complex analyses.

Expand Down Expand Up @@ -89,20 +89,21 @@ This will check for more recent versions of each package (within a Bioconductor
BiocManager::install()
```

Be careful: if you have a lot of packages to update, this can take a long time.
This might take some time if many packages need to be updated, but is typically
recommended to avoid issues resulting from outdated package versions.

## The `SingleCellExperiment` class

### Setup

First start by loading some libraries we'll be using:
We start by loading some libraries we'll be using:

```{r setup}
library(SingleCellExperiment)
library(MouseGastrulationData)
```

It's normal to see lot of startup messages when loading these packages.
It is normal to see lot of startup messages when loading these packages.

### Motivation and overview

Expand All @@ -118,11 +119,15 @@ knitr::include_graphics("http://bioconductor.org/books/release/OSCA.intro/images

:::: spoiler

### Before `SingleCellExperiment`
### Benefits of using the integrated `SingleCellExperiment` data container

Before `SingleCellExperiment`, coders working with single cell data would sometimes keep all of these components in separate objects e.g. a matrix of counts, a data.frame of sample metadata, a data.frame of gene annotations and so on. There were two main disadvantages to this sort of "from scratch" approach:
The complexity of the `SingleCellExperiment` container might be a little bit intimidating in the beginning.
One might be tempted to use a simpler approach by just keeping all of these components in separate objects,
e.g. a `matrix` of counts, a `data.frame` of sample metadata, a `data.frame` of gene annotations, and so on.

1. Tons of book-keeping. If you performed a QC step that removed dead cells, you also had to remember to remove that same set of cells from the cell-wise metadata. Un-expressed genes were dropped? Don't forget to filter the gene metadata table too.
There are two main disadvantages to this "from-scratch" approach:

1. It requires a substantial amount of manual bookkeeping to keep the different data components in sync. If you performed a QC step that removed dead cells from the count matrix, you also had to remember to remove that same set of cells from the cell-wise metadata. Did you filtered out genes that did not display sufficient expression levels to be retained for further analysis? Then you would need to make sure to not forget to filter the gene metadata table too.
2. All the downstream steps had to be "from scratch" as well. All the data munging, analysis, and visualization code had to be customized to the idiosyncrasies of a given input set.

::::
Expand All @@ -141,11 +146,11 @@ Depending on the object, slots can contain different types of data (e.g., numeri

:::: challenge

Try to get the data for a different sample from `WTChimeraData` (other than the fifth one).
Get the data for a different sample from `WTChimeraData` (other than the fifth one).

::: solution

Here we assign the sixth sample to `sce6`:
Here we obtain the sixth sample and assign it to `sce6`:

```{r, message = FALSE, warning=FALSE, eval=FALSE}
sce6 <- WTChimeraData(samples = 6)
Expand All @@ -167,7 +172,11 @@ names(assays(sce))
counts(sce)[1:3, 1:3]
```

You will notice that in this case we have a sparse matrix of class "dgTMatrix" inside the object. More generally, any "matrix-like" object can be used, e.g., dense matrices or HDF5-backed matrices (see the "Working with large data" episode).
You will notice that in this case we have a sparse matrix of class `dgTMatrix`
inside the object. More generally, any "matrix-like" object can be used, e.g.,
dense matrices or HDF5-backed matrices (as we will explore later in the
[Working with large data](https://ccb-hms.github.io/osca-workbench/large_data.html)
episode).

### `colData` and `rowData`

Expand All @@ -181,7 +190,7 @@ colData(sce)[1:3, 1:4]
rowData(sce)[1:3, 1:2]
```

You can access columns of the colData with the `$` accessor to quickly add cell-wise metadata to the colData
You can access columns of the colData with the `$` accessor to quickly add cell-wise metadata to the `colData`.

```{r}
sce$my_sum <- colSums(counts(sce))
Expand All @@ -191,25 +200,26 @@ colData(sce)[1:3,]

:::: challenge

Try to add a column of gene-wise metadata to the rowData.
Add a column of gene-wise metadata to the `rowData`.

::: solution

Here we add a column called "conservation" that is just an integer sequence from 1 to the number of genes.
Here, we add a column named `conservation` that could represent an evolutionary conservation score.

```{r}
rowData(sce)$conservation = rnorm(nrow(sce))
rowData(sce)$conservation <- rnorm(nrow(sce))
```

This is just a made-up example with a simple sequence of numbers, but in practice its convenient to store any sort of gene-wise information in the columns of the rowData.
This is just an example for demonstration purposes, but in practice it is convenient
and simplifies data management to store any sort of gene-wise information in the columns of the `rowData`.

:::

::::

### The `reducedDims`

Everything that we have described so far (except for the `counts` getter) is part of the `SummarizedExperiment` class that SingleCellExperiment extends. You can find a complete lesson on the `SummarizedExperiment` class in [Introduction to data analysis with R and Bioconductor](https://carpentries-incubator.github.io/bioc-intro/60-next-steps.html) course.
Everything that we have described so far (except for the `counts` getter) is part of the `SummarizedExperiment` class that `SingleCellExperiment` extends. You can find a complete lesson on the `SummarizedExperiment` class in [Introduction to data analysis with R and Bioconductor](https://carpentries-incubator.github.io/bioc-intro/60-next-steps.html) course.

One peculiarity of `SingleCellExperiment` is its ability to store reduced dimension matrices within the object. These may include PCA, t-SNE, UMAP, etc.

Expand All @@ -233,7 +243,7 @@ plotReducedDim(sce, "pca.corrected.E8.5", colour_by = "celltype.mapped")

#### Exercise 1

Create a `SingleCellExperiment` object: Try and create a `SingleCellExperiment` object "from scratch". Start from a `matrix` (either randomly generated or with some fake data in it) and add one or more columns as `colData`.
Create a `SingleCellExperiment` object "from scratch". That means: start from a `matrix` (either randomly generated or with some fake data in it) and add one or more columns as `colData`.

:::::::::::::: hint

Expand Down Expand Up @@ -262,7 +272,7 @@ my_sce

#### Exercise 2

Combining two objects: The `MouseGastrulationData` package contains several datasets. Download sample 6 of the chimera experiment by running `sce6 <- WTChimeraData(samples=6)`. Use the `cbind` function to combine the new data with the `sce` object created before.
Combine two `SingleCellExperiment` objects. The `MouseGastrulationData` package contains several datasets. Download sample 6 of the chimera experiment. Use the `cbind` function to combine the new data with the `sce` object created before.

::: solution

Expand Down
21 changes: 14 additions & 7 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,32 +8,39 @@ in individual cells has become routine. Consequently, a plethora of tools for
the analysis of single-cell data has been developed, making it hard to understand
the critical steps in the analysis workflow and the best methods for each objective of one’s study.

This tutorial aims to provide a solid foundation in using [Bioconductor](https://bioconductor.org)
This [Carpentries-style](https://carpentries.org/) tutorial aims to provide a
solid foundation in using [Bioconductor](https://bioconductor.org)
tools for single-cell RNA-seq (scRNA-seq) analysis by walking through various steps of
typical workflows using example datasets.

This tutorial is based on the the online book "Orchestrating Single-Cell
Analysis with Bioconductor" ([OSCA](https://bioconductor.org/books/release/OSCA/)),
started in 2018 and continuously updated by many contributors from the Bioconductor community.
[published in 2020](https://doi.org/10.1038%2Fs41592-019-0654-x),
and continuously updated by many contributors from the Bioconductor community.
Like the book, this tutorial strives to be of interest to the experimental biologists
wanting to analyze their data and to the bioinformaticians approaching single-cell data.

This is a new lesson built with [The Carpentries Workbench][workbench].


[workbench]: https://carpentries.github.io/sandpaper-docs

:::::::::::::::::::::::::::::::::::::::::: prereq

## Prerequisites

- Familiarity with R/Bioconductor, such as the
[Introduction to data analysis with R and Bioconductor](https://carpentries-incubator.github.io/bioc-intro/)
lesson.
- Familiarity with multivariate analysis and dimensionality reduction, such as
[Chapter 7](https://www.huber.embl.de/msmb/07-chap.html) of the book
*Modern Statistics for Modern Biology* by Holmes and Huber.
- Familiarity with the biology of gene expression and scRNA-seq, such as the review article
[A practical guide to single-cell RNA-sequencing](https://doi.org/10.1186/s13073-017-0467-4) by Haque et.al.


::::::::::::::::::::::::::::::::::::::::::::::::::

If you use materials of this lesson in published research, please cite:

Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, Marini F,
Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pagès H, Smith ML, Huber W,
Morgan M, Gottardo R, Hicks SC. Orchestrating single-cell analysis with
Bioconductor. *Nature Methods*, 2020.
doi: [10.1038/s41592-019-0654-x](https://doi.org/10.1038/s41592-019-0654-x)

0 comments on commit 8f633b8

Please sign in to comment.