From d635f6dfde4364b282f129e83b1c4abba192ff67 Mon Sep 17 00:00:00 2001 From: csmagnano Date: Wed, 21 Feb 2024 10:27:51 -0500 Subject: [PATCH] Scaling back - removing HCA lesson temporarily --- config.yaml | 1 - episodes/hca.Rmd | 395 ----------------------------------------------- 2 files changed, 396 deletions(-) delete mode 100644 episodes/hca.Rmd diff --git a/config.yaml b/config.yaml index 78dfe50..65d6cd1 100644 --- a/config.yaml +++ b/config.yaml @@ -65,7 +65,6 @@ episodes: - cell_type_annotation.Rmd - multi-sample.Rmd - large_data.Rmd -- hca.Rmd # Information for Learners learners: diff --git a/episodes/hca.Rmd b/episodes/hca.Rmd deleted file mode 100644 index 1b8e990..0000000 --- a/episodes/hca.Rmd +++ /dev/null @@ -1,395 +0,0 @@ ---- -title: Accessing data from the Human Cell Atlas (HCA) -teaching: 10 # Minutes of teaching in the lesson -exercises: 2 # Minutes of exercises in the lesson ---- - -:::::::::::::::::::::::::::::::::::::: questions - -- TODO - -:::::::::::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::: objectives - -- TODO - -:::::::::::::::::::::::::::::::::::::::::::::::: - -```{r load-styles, include=FALSE} -library(BiocStyle) -``` - -# HCA Project - -The Human Cell Atlas (HCA) is a large project that aims to learn from and map -every cell type in the human body. The project extracts spatial and molecular -characteristics in order to understand cellular function and networks. It is an -international collaborative that charts healthy cells in the human body at all -ages. There are about 37.2 trillion cells in the human body. To read more about -the project, head over to their website at https://www.humancellatlas.org. - -# CELLxGENE - -CELLxGENE is a database and a suite of tools that help scientists to find, -download, explore, analyze, annotate, and publish single cell data. It includes -several analytic and visualization tools to help you to discover single cell -data patterns. To see the list of tools, browse to -https://cellxgene.cziscience.com/. - -# CELLxGENE | Census - -The Census provides efficient computational tooling to access, query, and -analyze all single-cell RNA data from CZ CELLxGENE Discover. Using a new access -paradigm of cell-based slicing and querying, you can interact with the data -through TileDB-SOMA, or get slices in AnnData or Seurat objects, thus -accelerating your research by significantly minimizing data harmonization at -https://chanzuckerberg.github.io/cellxgene-census/. - -# The CuratedAtlasQueryR Project - -To systematically characterize the immune system across tissues, demographics -and multiple studies, single cell transcriptomics data was harmonized from the -CELLxGENE database. Data from 28,975,366 cells that cover 156 tissues (excluding -cell cultures), 12,981 samples, and 324 studies were collected. The metadata was -standardized, including sample identifiers, tissue labels (based on anatomy) and -age. Also, the gene-transcript abundance of all samples was harmonized by -putting values on the positive natural scale (i.e. non-logarithmic). - -To model the immune system across studies, we adopted a consistent immune -cell-type ontology appropriate for lymphoid and non-lymphoid tissues. We applied -a consensus cell labeling strategy between the Seurat blueprint and Monaco -[-@Monaco2019] to minimize biases in immune cell classification from -study-specific standards. - -`CuratedAtlasQueryR` supports data access and programmatic exploration of the -harmonized atlas. Cells of interest can be selected based on ontology, tissue of -origin, demographics, and disease. For example, the user can select CD4 T helper -cells across healthy and diseased lymphoid tissue. The data for the selected -cells can be downloaded locally into popular single-cell data containers. Pseudo -bulk counts are also available to facilitate large-scale, summary analyses of -transcriptional profiles. This platform offers a standardized workflow for -accessing atlas-level datasets programmatically and reproducibly. - -```{r,echo=FALSE} -knitr::include_graphics( - "figures/HCA_sccomp_SUPPLEMENTARY_technical_cartoon_curatedAtlasQuery.png" -) -``` - -# Data Sources in R / Bioconductor - -There are a few options to access single cell data with R / Bioconductor. - -| Package | Target | Description | -|---------|-------------|---------| -| [hca](https://bioconductor.org/packages/hca) | [HCA Data Portal API](https://www.humancellatlas.org/data-portal/) | Project, Sample, and File level HCA data | -| [cellxgenedp](https://bioconductor.org/packages/cellxgenedp) | [CellxGene](https://cellxgene.cziscience.com/) | Human and mouse SC data including HCA | -| [CuratedAtlasQueryR](https://stemangiola.github.io/CuratedAtlasQueryR/) | [CellxGene](https://cellxgene.cziscience.com/) | fine-grained query capable CELLxGENE data including HCA | - -# Installation - -```{r,eval=FALSE} -if (!requireNamespace("BiocManager", quietly = TRUE)) - install.packages("BiocManager") - -BiocManager::install("stemangiola/CuratedAtlasQueryR") -``` - -# Package load - -```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE} -library(CuratedAtlasQueryR) -library(dplyr) -``` - -# HCA Metadata - -The metadata allows the user to get a lay of the land of what is available -via the package. In this example, we are using the sample database URL which -allows us to get a small and quick subset of the available metadata. - -```{r} -metadata <- get_metadata(remote_url = CuratedAtlasQueryR::SAMPLE_DATABASE_URL) -``` - -Get a view of the first 10 columns in the metadata with `glimpse` - -```{r} -metadata |> - select(1:10) |> - glimpse() -``` - -# A note on the piping operator - -The vignette materials provided by `CuratedAtlasQueryR` show the use of the -'native' R pipe (implemented after R version `4.1.0`). For those not familiar -with the pipe operator (`|>`), it allows you to chain functions by passing the -left-hand side (LHS) to the first input (typically) on the right-hand side -(RHS). - -In this example, we are extracting the `iris` data set from the `datasets` -package and 'then' taking a subset where the sepal lengths are greater than 5 -and 'then' summarizing the data for each level in the `Species` variable with a -`mean`. The pipe operator can be read as 'then'. - -```{r} -data("iris", package = "datasets") - -iris |> - subset(Sepal.Length > 5) |> - aggregate(. ~ Species, data = _, mean) -``` - -# Summarizing the metadata - -For each distinct tissue and dataset combination, count the number of datasets -by tissue type. - -```{r} -metadata |> - distinct(tissue, dataset_id) |> - count(tissue) -``` - -# Columns available in the metadata - -```{r} -head(names(metadata), 10) -``` - -# Available assays - -```{r} -metadata |> - distinct(assay, dataset_id) |> - count(assay) -``` - -# Available organisms - -```{r} -metadata |> - distinct(organism, dataset_id) |> - count(organism) -``` - -## Download single-cell RNA sequencing counts - -The data can be provided as either "counts" or counts per million "cpm" as given -by the `assays` argument in the `get_single_cell_experiment()` function. By -default, the `SingleCellExperiment` provided will contain only the 'counts' -data. - -### Query raw counts - -```{r} -single_cell_counts <- - metadata |> - dplyr::filter( - ethnicity == "African" & - stringr::str_like(assay, "%10x%") & - tissue == "lung parenchyma" & - stringr::str_like(cell_type, "%CD4%") - ) |> - get_single_cell_experiment() - -single_cell_counts -``` - -### Query counts scaled per million - -This is helpful if just few genes are of interest, as they can be compared -across samples. - -```{r} -metadata |> - dplyr::filter( - ethnicity == "African" & - stringr::str_like(assay, "%10x%") & - tissue == "lung parenchyma" & - stringr::str_like(cell_type, "%CD4%") - ) |> - get_single_cell_experiment(assays = "cpm") -``` - -### Extract only a subset of genes - -```{r} -single_cell_counts <- - metadata |> - dplyr::filter( - ethnicity == "African" & - stringr::str_like(assay, "%10x%") & - tissue == "lung parenchyma" & - stringr::str_like(cell_type, "%CD4%") - ) |> - get_single_cell_experiment(assays = "cpm", features = "PUM1") - -single_cell_counts -``` - -### Extracting counts as a Seurat object - -If needed, the H5 `SingleCellExperiment` can be converted into a Seurat object. -Note that it may take a long time and use a lot of memory depending on how many -cells you are requesting. - -```{r,eval=FALSE} -single_cell_counts <- - metadata |> - dplyr::filter( - ethnicity == "African" & - stringr::str_like(assay, "%10x%") & - tissue == "lung parenchyma" & - stringr::str_like(cell_type, "%CD4%") - ) |> - get_seurat() - -single_cell_counts -``` - -## Save your `SingleCellExperiment` - -### Saving as HDF5 - -The recommended way of saving these `SingleCellExperiment` objects, if -necessary, is to use `saveHDF5SummarizedExperiment` from the `HDF5Array` -package. - -```{r, eval=FALSE} -single_cell_counts |> saveHDF5SummarizedExperiment("single_cell_counts") -``` - -# Exercises - - - - - - - - -:::::::::::::::::::::::::::::::::: challenge - -#### Exercise 1 - -Use `count` and `arrange` to get the number of cells per tissue in descending -order. - -:::::::::::::: solution - -```{r,eval=FALSE} -metadata |> - count(tissue) |> - arrange(-n) -``` - -::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::: challenge - -#### Exercise 2 - -Use `dplyr`-isms to group by `tissue` and `cell_type` and get a tally of the -highest number of cell types per tissue combination. What tissue has the most -numerous type of cells? - -:::::::::::::: solution - -```{r,eval=FALSE} -metadata |> - group_by(tissue, cell_type) |> - count() |> - arrange(-n) -``` -::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::: challenge - -#### Exercise 3 - -Spot some differences between the `tissue` and `tissue_harmonised` columns. -Use `count` to summarise. - -:::::::::::::: solution - -```{r} -metadata |> - count(tissue) |> - arrange(-n) - -metadata |> - count(tissue_harmonised) |> - arrange(-n) -``` - -::::::::::::::::::::::: - -::::::::::::::::::::::: - -::: callout - -To see the full list of curated columns in the metadata, see the Details section -in the `?get_metadata` documentation page. - -::: - - -:::::::::::::::::::::::::::::::::: challenge - -#### Exercise 4 - -Spot some differences between the `tissue` and `tissue_harmonised` columns. -Now that we are a little familiar with navigating the metadata, let's obtain -a `SingleCellExperiment` of 10X scRNA-seq counts of `cd8 tem` `lung` cells for -females older than `80` with `COVID-19`. Note: Use the harmonized columns, where -possible. - -:::::::::::::: solution - -```{r} -metadata |> - dplyr::filter( - sex == "female" & - age_days > 80 * 365 & - stringr::str_like(assay, "%10x%") & - disease == "COVID-19" & - tissue_harmonised == "lung" & - cell_type_harmonised == "cd8 tem" - ) |> - get_single_cell_experiment() -``` - -::::::::::::::::::::::: - -::::::::::::::::::::::: - - -::::::::::::::::::::::::::::::::::::: keypoints - -- TODO - -:::::::::::::::::::::::::::::::::::::::::::::::: - -# Session Info - -```{r} -sessionInfo() -``` - -# Acknowledgements - -Thank you to [Stefano Mangiola](https://github.com/stemangiola) and his team for -developing -[CuratedAtlasQueryR](https://github.com/stemangiola/CuratedAtlasQueryR) and -graciously providing the content from their vignette. Make sure to keep an eye -out for their publication for proper citation. Their bioRxiv paper can be found -at . - -# References