diff --git a/episodes/hca.Rmd b/episodes/hca.Rmd index 242a168..ed42d2a 100644 --- a/episodes/hca.Rmd +++ b/episodes/hca.Rmd @@ -46,28 +46,15 @@ https://chanzuckerberg.github.io/cellxgene-census/. ## The CuratedAtlasQueryR Project -To systematically characterize the immune system across tissues, demographics -and multiple studies, single cell transcriptomics data was harmonized from the -CELLxGENE database. Data from 28,975,366 cells that cover 156 tissues (excluding -cell cultures), 12,981 samples, and 324 studies were collected. The metadata was -standardized, including sample identifiers, tissue labels (based on anatomy) and -age. Also, the gene-transcript abundance of all samples was harmonized by -putting values on the positive natural scale (i.e. non-logarithmic). - -To model the immune system across studies, we adopted a consistent immune -cell-type ontology appropriate for lymphoid and non-lymphoid tissues. We applied -a consensus cell labeling strategy between the Seurat blueprint and Monaco[^1] -to minimize biases in immune cell classification from -study-specific standards. +The `CuratedAtlasQueryR` is an alternative package that can also be used to access the CELLxGENE data from R through a tidy API. The data has also been harmonized, curated, and re-annotated across studies. `CuratedAtlasQueryR` supports data access and programmatic exploration of the harmonized atlas. Cells of interest can be selected based on ontology, tissue of origin, demographics, and disease. For example, the user can select CD4 T helper cells across healthy and diseased lymphoid tissue. The data for the selected -cells can be downloaded locally into popular single-cell data containers. Pseudo +cells can be downloaded locally into SingleCellExperiment objects. Pseudo bulk counts are also available to facilitate large-scale, summary analyses of -transcriptional profiles. This platform offers a standardized workflow for -accessing atlas-level datasets programmatically and reproducibly. +transcriptional profiles. ```{r, echo = FALSE} knitr::include_graphics("https://raw.githubusercontent.com/ccb-hms/osca-workbench/main/episodes/figures/curatedAtlasQuery.png") @@ -111,7 +98,8 @@ via the package. In this example, we are using the sample database URL which allows us to get a small and quick subset of the available metadata. ```{r, message = FALSE} -metadata <- get_metadata(remote_url = CuratedAtlasQueryR::SAMPLE_DATABASE_URL) +metadata <- get_metadata(remote_url = CuratedAtlasQueryR::SAMPLE_DATABASE_URL) |> + collect() ``` Get a view of the first 10 columns in the metadata with `glimpse` @@ -127,20 +115,23 @@ metadata |> The vignette materials provided by `CuratedAtlasQueryR` show the use of the 'native' R pipe (implemented after R version `4.1.0`). For those not familiar with the pipe operator (`|>`), it allows you to chain functions by passing the -left-hand side (LHS) to the first input (typically) on the right-hand side -(RHS). +left-hand side as the first argument to the function on the right-hand side. It is used extensively in the `tidyverse` dialect of R, especially within the [`dplyr` package](https://dplyr.tidyverse.org/). -In this example, we are extracting the `iris` data set from the `datasets` -package and 'then' taking a subset where the sepal lengths are greater than 5 -and 'then' summarizing the data for each level in the `Species` variable with a -`mean`. The pipe operator can be read as 'then'. +The pipe operator can be read as "and then". Thankfully, R doesn't care about whitespace, so it's common to start a new line after a pipe. Together these points enable users to "chain" complex sequences of commands into readable blocks. + +In this example, we start with the built-in `mtcars` dataset and then filter to rows where `cyl` is not equal to 4, and then compute the mean `disp` value by each unique `cyl` value. ```{r} -data("iris", package = "datasets") +mtcars |> + filter(cyl != 4) |> + summarise(avg_disp = mean(disp), + .by = cyl) +``` -iris |> - subset(Sepal.Length > 5) |> - aggregate(. ~ Species, data = _, mean) +This command is equivalent to the following: + +```{r eval=FALSE} +summarise(filter(mtcars, cyl != 4), mean_disp = mean(disp), .by = cyl) ``` ## Summarizing the metadata @@ -160,6 +151,16 @@ metadata |> head(names(metadata), 10) ``` +:::: challenge + +Glance over the full list of metadata column names. Do any other metadata columns jump out as interesting to you for your work? + +```{r eval=FALSE} +metadata |> names() |> sort() +``` + +:::: + ## Available assays ```{r} @@ -183,17 +184,23 @@ by the `assays` argument in the `get_single_cell_experiment()` function. By default, the `SingleCellExperiment` provided will contain only the 'counts' data. -#### Query raw counts +For the sake of demonstration, we'll focus this small subset of samples: -```{r, message = FALSE} -single_cell_counts <- - metadata |> - dplyr::filter( +```{r} +sample_subset = metadata |> + filter( ethnicity == "African" & stringr::str_like(assay, "%10x%") & tissue == "lung parenchyma" & stringr::str_like(cell_type, "%CD4%") - ) |> + ) +``` + + +#### Query raw counts + +```{r, message = FALSE} +single_cell_counts <- sample_subset |> get_single_cell_experiment() single_cell_counts @@ -205,27 +212,14 @@ This is helpful if just few genes are of interest, as they can be compared across samples. ```{r, message = FALSE} -metadata |> - dplyr::filter( - ethnicity == "African" & - stringr::str_like(assay, "%10x%") & - tissue == "lung parenchyma" & - stringr::str_like(cell_type, "%CD4%") - ) |> +sample_subset |> get_single_cell_experiment(assays = "cpm") ``` #### Extract only a subset of genes ```{r, message = FALSE} -single_cell_counts <- - metadata |> - dplyr::filter( - ethnicity == "African" & - stringr::str_like(assay, "%10x%") & - tissue == "lung parenchyma" & - stringr::str_like(cell_type, "%CD4%") - ) |> +single_cell_counts <- sample_subset |> get_single_cell_experiment(assays = "cpm", features = "PUM1") single_cell_counts @@ -238,14 +232,7 @@ Note that it may take a long time and use a lot of memory depending on how many cells you are requesting. ```{r,eval=FALSE} -single_cell_counts <- - metadata |> - dplyr::filter( - ethnicity == "African" & - stringr::str_like(assay, "%10x%") & - tissue == "lung parenchyma" & - stringr::str_like(cell_type, "%CD4%") - ) |> +single_cell_counts <- sample_subset |> get_seurat() single_cell_counts @@ -295,8 +282,7 @@ numerous type of cells? ```{r,eval=FALSE} metadata |> - group_by(tissue, cell_type) |> - count() |> + count(tissue, cell_type) |> arrange(-n) ``` ::::::::::::::::::::::: @@ -341,7 +327,7 @@ possible. ```{r, message = FALSE} metadata |> - dplyr::filter( + filter( sex == "female" & age_days > 80 * 365 & stringr::str_like(assay, "%10x%") & @@ -362,4 +348,4 @@ metadata |> :::::::::::::::::::::::::::::::::::::::::::::::: -[^1]: [Monaco 2019](learners/reference.md#litref) + diff --git a/episodes/large_data.Rmd b/episodes/large_data.Rmd index b5ed2f5..00fa424 100644 --- a/episodes/large_data.Rmd +++ b/episodes/large_data.Rmd @@ -559,43 +559,34 @@ Use the function `system.time` to obtain the runtime of each job. :::::::::::::: solution -TODO -::::::::::::::::::::::: +```{r eval=FALSE} -::::::::::::::::::::::::::::::::::::::::::::: +sce.brain = logNormCounts(sce.brain) -:::::::::::::::::::::::::::::::::: challenge +system.time({i.out <- runPCA(sce.brain, ncomponents = 20, + BSPARAM = ExactParam(), + BPPARAM = SerialParam())}) -#### Exercise 3: Conversion to Seurat +system.time({i.out <- runPCA(sce.brain, ncomponents = 20, + BSPARAM = ExactParam(), + BPPARAM = MulticoreParam(workers = 2))}) -The [scRNAseq](https://bioconductor.org/packages/scRNAseq) -package provides gene-level counts for a collection of public scRNA-seq datasets, -stored as `SingleCellExperiment` objects with annotated cell- and gene-level metadata. -Consult the vignette of the [scRNAseq](https://bioconductor.org/packages/scRNAseq) -package to inspect all available datasets and select a dataset of your choice. -Convert the chosen dataset to a Seurat object and produce a PCA plot with cells -colored by a cell metadata column of your choice. +system.time({i.out <- runPCA(sce.brain, ncomponents = 20, + BSPARAM = ExactParam(), + BPPARAM = MulticoreParam(workers = 3))}) -:::::::::::::: hint - -Use Seurat's `DimPlot` function. +``` ::::::::::::::::::::::: -:::::::::::::: solution - -Use Seurat's `DimPlot` function. - -::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::: :::::::::::::: checklist ## Further Reading * OSCA book, [Chapter 14](https://bioconductor.org/books/release/OSCA.advanced/dealing-with-big-data.html): Dealing with big data * The `BiocParallel` `r Biocpkg("BiocParallel", vignette = "Introduction_To_BiocParallel.html", label = "intro vignette")`. - :::::::::::::: ::::::::::::::::::::::::::::::::::::: keypoints