Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exercises #21

Merged
merged 4 commits into from
Aug 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 46 additions & 60 deletions episodes/hca.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,28 +46,15 @@ https://chanzuckerberg.github.io/cellxgene-census/.

## The CuratedAtlasQueryR Project

To systematically characterize the immune system across tissues, demographics
and multiple studies, single cell transcriptomics data was harmonized from the
CELLxGENE database. Data from 28,975,366 cells that cover 156 tissues (excluding
cell cultures), 12,981 samples, and 324 studies were collected. The metadata was
standardized, including sample identifiers, tissue labels (based on anatomy) and
age. Also, the gene-transcript abundance of all samples was harmonized by
putting values on the positive natural scale (i.e. non-logarithmic).

To model the immune system across studies, we adopted a consistent immune
cell-type ontology appropriate for lymphoid and non-lymphoid tissues. We applied
a consensus cell labeling strategy between the Seurat blueprint and Monaco[^1]
to minimize biases in immune cell classification from
study-specific standards.
The `CuratedAtlasQueryR` is an alternative package that can also be used to access the CELLxGENE data from R through a tidy API. The data has also been harmonized, curated, and re-annotated across studies.

`CuratedAtlasQueryR` supports data access and programmatic exploration of the
harmonized atlas. Cells of interest can be selected based on ontology, tissue of
origin, demographics, and disease. For example, the user can select CD4 T helper
cells across healthy and diseased lymphoid tissue. The data for the selected
cells can be downloaded locally into popular single-cell data containers. Pseudo
cells can be downloaded locally into SingleCellExperiment objects. Pseudo
bulk counts are also available to facilitate large-scale, summary analyses of
transcriptional profiles. This platform offers a standardized workflow for
accessing atlas-level datasets programmatically and reproducibly.
transcriptional profiles.

```{r, echo = FALSE}
knitr::include_graphics("https://raw.githubusercontent.com/ccb-hms/osca-workbench/main/episodes/figures/curatedAtlasQuery.png")
Expand Down Expand Up @@ -111,7 +98,8 @@ via the package. In this example, we are using the sample database URL which
allows us to get a small and quick subset of the available metadata.

```{r, message = FALSE}
metadata <- get_metadata(remote_url = CuratedAtlasQueryR::SAMPLE_DATABASE_URL)
metadata <- get_metadata(remote_url = CuratedAtlasQueryR::SAMPLE_DATABASE_URL) |>
collect()
```

Get a view of the first 10 columns in the metadata with `glimpse`
Expand All @@ -127,20 +115,23 @@ metadata |>
The vignette materials provided by `CuratedAtlasQueryR` show the use of the
'native' R pipe (implemented after R version `4.1.0`). For those not familiar
with the pipe operator (`|>`), it allows you to chain functions by passing the
left-hand side (LHS) to the first input (typically) on the right-hand side
(RHS).
left-hand side as the first argument to the function on the right-hand side. It is used extensively in the `tidyverse` dialect of R, especially within the [`dplyr` package](https://dplyr.tidyverse.org/).

In this example, we are extracting the `iris` data set from the `datasets`
package and 'then' taking a subset where the sepal lengths are greater than 5
and 'then' summarizing the data for each level in the `Species` variable with a
`mean`. The pipe operator can be read as 'then'.
The pipe operator can be read as "and then". Thankfully, R doesn't care about whitespace, so it's common to start a new line after a pipe. Together these points enable users to "chain" complex sequences of commands into readable blocks.

In this example, we start with the built-in `mtcars` dataset and then filter to rows where `cyl` is not equal to 4, and then compute the mean `disp` value by each unique `cyl` value.

```{r}
data("iris", package = "datasets")
mtcars |>
filter(cyl != 4) |>
summarise(avg_disp = mean(disp),
.by = cyl)
```

iris |>
subset(Sepal.Length > 5) |>
aggregate(. ~ Species, data = _, mean)
This command is equivalent to the following:

```{r eval=FALSE}
summarise(filter(mtcars, cyl != 4), mean_disp = mean(disp), .by = cyl)
```

## Summarizing the metadata
Expand All @@ -160,6 +151,16 @@ metadata |>
head(names(metadata), 10)
```

:::: challenge

Glance over the full list of metadata column names. Do any other metadata columns jump out as interesting to you for your work?

```{r eval=FALSE}
metadata |> names() |> sort()
```

::::

## Available assays

```{r}
Expand All @@ -183,17 +184,23 @@ by the `assays` argument in the `get_single_cell_experiment()` function. By
default, the `SingleCellExperiment` provided will contain only the 'counts'
data.

#### Query raw counts
For the sake of demonstration, we'll focus this small subset of samples:

```{r, message = FALSE}
single_cell_counts <-
metadata |>
dplyr::filter(
```{r}
sample_subset = metadata |>
filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
)
```


#### Query raw counts

```{r, message = FALSE}
single_cell_counts <- sample_subset |>
get_single_cell_experiment()

single_cell_counts
Expand All @@ -205,27 +212,14 @@ This is helpful if just few genes are of interest, as they can be compared
across samples.

```{r, message = FALSE}
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
sample_subset |>
get_single_cell_experiment(assays = "cpm")
```

#### Extract only a subset of genes

```{r, message = FALSE}
single_cell_counts <-
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
single_cell_counts <- sample_subset |>
get_single_cell_experiment(assays = "cpm", features = "PUM1")

single_cell_counts
Expand All @@ -238,14 +232,7 @@ Note that it may take a long time and use a lot of memory depending on how many
cells you are requesting.

```{r,eval=FALSE}
single_cell_counts <-
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
single_cell_counts <- sample_subset |>
get_seurat()

single_cell_counts
Expand Down Expand Up @@ -295,8 +282,7 @@ numerous type of cells?

```{r,eval=FALSE}
metadata |>
group_by(tissue, cell_type) |>
count() |>
count(tissue, cell_type) |>
arrange(-n)
```
:::::::::::::::::::::::
Expand Down Expand Up @@ -341,7 +327,7 @@ possible.

```{r, message = FALSE}
metadata |>
dplyr::filter(
filter(
sex == "female" &
age_days > 80 * 365 &
stringr::str_like(assay, "%10x%") &
Expand All @@ -362,4 +348,4 @@ metadata |>

::::::::::::::::::::::::::::::::::::::::::::::::

[^1]: [Monaco 2019](learners/reference.md#litref)

35 changes: 13 additions & 22 deletions episodes/large_data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -559,43 +559,34 @@ Use the function `system.time` to obtain the runtime of each job.

:::::::::::::: solution

TODO
:::::::::::::::::::::::
```{r eval=FALSE}

:::::::::::::::::::::::::::::::::::::::::::::
sce.brain = logNormCounts(sce.brain)

:::::::::::::::::::::::::::::::::: challenge
system.time({i.out <- runPCA(sce.brain, ncomponents = 20,
BSPARAM = ExactParam(),
BPPARAM = SerialParam())})

#### Exercise 3: Conversion to Seurat
system.time({i.out <- runPCA(sce.brain, ncomponents = 20,
BSPARAM = ExactParam(),
BPPARAM = MulticoreParam(workers = 2))})

The [scRNAseq](https://bioconductor.org/packages/scRNAseq)
package provides gene-level counts for a collection of public scRNA-seq datasets,
stored as `SingleCellExperiment` objects with annotated cell- and gene-level metadata.
Consult the vignette of the [scRNAseq](https://bioconductor.org/packages/scRNAseq)
package to inspect all available datasets and select a dataset of your choice.
Convert the chosen dataset to a Seurat object and produce a PCA plot with cells
colored by a cell metadata column of your choice.
system.time({i.out <- runPCA(sce.brain, ncomponents = 20,
BSPARAM = ExactParam(),
BPPARAM = MulticoreParam(workers = 3))})

:::::::::::::: hint

Use Seurat's `DimPlot` function.
```

:::::::::::::::::::::::

:::::::::::::: solution

Use Seurat's `DimPlot` function.

:::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::

:::::::::::::: checklist
## Further Reading

* OSCA book, [Chapter 14](https://bioconductor.org/books/release/OSCA.advanced/dealing-with-big-data.html): Dealing with big data
* The `BiocParallel` `r Biocpkg("BiocParallel", vignette = "Introduction_To_BiocParallel.html", label = "intro vignette")`.

::::::::::::::

::::::::::::::::::::::::::::::::::::: keypoints
Expand Down
Loading