diff --git a/.github/workflows/publish.yml b/.github/workflows/publish_both.yml similarity index 82% rename from .github/workflows/publish.yml rename to .github/workflows/publish_both.yml index 6062e05..bcf0e73 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish_both.yml @@ -1,10 +1,7 @@ name: Quarto Publish on: - push: - branches: [ master ] - pull_request: - branches: [ master ] + workflow_dispatch: jobs: build-deploy: @@ -19,11 +16,15 @@ jobs: - name: Check out repository uses: actions/checkout@v4 - - name: Install rsync + - name: Install rsync and upgrade run: | - sudo apt-get update + sudo apt-get -y update sudo apt-get install -y rsync + # - name: Install alternatives for Python + # run: | + # sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 100 + # - name: Install curl # run: | # sudo apt-get install libcurl4-openssl-dev @@ -43,13 +44,20 @@ jobs: run: quarto install tinytex # - name: Setup Python - # uses: actions/setup-python@v4 + # uses: actions/setup-python@v5 # with: - # python-version: '3.9' + # python-version: '3.10' # cache: 'pip' + - name: Test python version + run: | + python3 --version + - name: Install Python dependencies - run: pip install -r requirements.txt + run: pip3 install -r requirements.txt + + - name: install geniml + run: pip3 install geniml # - name: Setup R # uses: r-lib/actions/setup-r@v2 diff --git a/.github/workflows/publish_python.yml b/.github/workflows/publish_python.yml new file mode 100644 index 0000000..d0dbe4c --- /dev/null +++ b/.github/workflows/publish_python.yml @@ -0,0 +1,57 @@ +name: Quarto Publish + +on: + push: + branches: [ master ] + pull_request: + branches: [ master ] + +jobs: + build-deploy: + runs-on: ubuntu-latest + permissions: + contents: write + steps: + - name: Check out repository + uses: actions/checkout@v4 + + - name: Set up Quarto + uses: quarto-dev/quarto-actions/setup@v2 + with: + tinytex: true + + - name: Install Python and Dependencies + uses: actions/setup-python@v5 + with: + python-version: '3.11' + cache: 'pip' + - run: pip install -r requirements.txt + + # build SQLite from source, because I need 3.35<= + - name: Download SQLite3 + run: | + wget https://www.sqlite.org/2024/sqlite-autoconf-3450300.tar.gz + tar -xvf sqlite-autoconf-3450300.tar.gz + + - name: Install SQLite3 + run: | + cd sqlite-autoconf-3450300 + ./configure + make + sudo make install + export PATH="/usr/local/lib:$PATH" + cd .. + + - name: Render + uses: quarto-dev/quarto-actions/render@v2 + with: + to: html + env: + LD_LIBRARY_PATH: /usr/local/lib + + - name: Publish to GH Pages + if: github.ref == 'refs/heads/master' + uses: quarto-dev/quarto-actions/publish@v2 + with: + target: gh-pages # The branch the action should deploy to. + render: false diff --git a/.gitignore b/.gitignore index 78ec4c4..324eab1 100644 --- a/.gitignore +++ b/.gitignore @@ -8,4 +8,5 @@ chapters/zilinois_lung_with_celltypist/ *whee.h5 *.tiledb *_cache -*_files \ No newline at end of file +*_files +tutorials/cache/*`` \ No newline at end of file diff --git a/README.md b/README.md index b573793..ec4d43f 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -## BiocPy tutorial +## Bioc 2024 workshop tutorial This is a reproducible Quarto book with reusable snippets. If you're new to Quarto books, you can explore more about them [here](https://quarto.org/docs/books). @@ -6,8 +6,18 @@ This is a reproducible Quarto book with reusable snippets. If you're new to Quar To get started locally, follow these steps: -- Install quarto-cli. -- Install the necessary packages listed in requirements.txt. +- Install [quarto-cli](https://quarto.org/docs/get-started/). +- Install the necessary packages listed in [requirements.txt](./requirements.txt) and [rpackages.R](rpackages.R). + +```shell +pip install -r requirements.txt +Rscript rpackages.R +``` + - Run quarto preview to view the HTML version of the site. +```shell +qurto preview +``` + Take advantage of GitHub actions, which are available to automatically publish the tutorial book whenever changes are made on the **master** branch. \ No newline at end of file diff --git a/_quarto.yml b/_quarto.yml index 4d45764..554c8e4 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -14,8 +14,9 @@ website: - index.qmd - section: "Tutorials" contents: + - tutorials/genomic_ranges.qmd - tutorials/annotate_cell_types.qmd - - tutorials/sessioninfo + - tutorials/sessioninfo.qmd # navbar: # left: diff --git a/assets/single-cell-space.jpg b/assets/single-cell-space.jpg new file mode 100644 index 0000000..3a0e3a5 Binary files /dev/null and b/assets/single-cell-space.jpg differ diff --git a/index.qmd b/index.qmd index 9c2953f..2f978db 100644 --- a/index.qmd +++ b/index.qmd @@ -1,13 +1,13 @@ # Welcome Welcome to our workshop on exploring the data structures and packages -available in [BiocPy](https://github.com/biocpy), a project that brings -the power of Bioconductor to Python. +available in [BiocPy](https://github.com/biocpy), a project that aims +to facilitate Bioconductor workflows in Python. In this workshop, we will focus on interoperability between R and Python, covering two main topics: - Reading an RDS file containing `GenomicRanges` and performing downstream range-based analyses. -- Annotating single-cell RNA-seq data analysis using the [scrnaseq](https://github.com/biocpy/scrnaseq) package. +- Annotating single-cell RNA-seq data using the [scrnaseq](https://github.com/biocpy/scrnaseq) package. Attendees will learn how to represent and manipulate their datasets in Python in the same manner as in R/Bioconductor. @@ -20,6 +20,8 @@ on [GitHub](https://github.com/BiocPy). - [Jayaram Kancherla](https://github.com/jkanche) - [Aaron Lun](https://github.com/LTLA) +Always looking for more contributions from the community to improve our packages! Checkout the issues or discussion in our GitHub organization. + ---- ## Other resources @@ -29,4 +31,4 @@ on [GitHub](https://github.com/BiocPy). ## Developer notes -This is a reproducible Quarto book with reusable snippets. To learn more about Quarto books visit . Check out [Reproduce me](./chapters/sessioninfo.qmd) for more information. \ No newline at end of file +This is a reproducible Quarto book with reusable snippets. To learn more about Quarto books visit . Check out [Session Info](./chapters/sessioninfo.qmd) for more information. \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 2f00338..f248bea 100644 --- a/requirements.txt +++ b/requirements.txt @@ -2,9 +2,9 @@ biocframe>=0.5.0 summarizedexperiment>=0.4.3 singlecellexperiment>=0.4.2 multiassayexperiment>=0.4.2 -genomicranges +genomicranges>=0.4.28 biocutils>=0.1.5 -rds2py>=0.4.0 +rds2py>=0.4.3 scranpy singler>=0.2.0 numpy @@ -20,4 +20,4 @@ celldex scrnaseq anndata matplotlib -scanpy \ No newline at end of file +geniml diff --git a/rpackages.R b/rpackages.R index 1c751eb..bbab4b5 100644 --- a/rpackages.R +++ b/rpackages.R @@ -1,3 +1,6 @@ install.packages(c("BiocManager"), repos='http://cran.us.r-project.org') library(BiocManager) -BiocManager::install(c("scRNAseq", "celldex", "SingleR", "scuttle", "reticulate", "rmarkdown", "knitr", "downlit", "xml2", "ggplot2", "edgeR")) \ No newline at end of file +BiocManager::install( + c("scRNAseq", "celldex", "SingleR", "scuttle", "reticulate", + "rmarkdown", "knitr", "downlit", "xml2", "ggplot2", "edgeR", + "AnnotationHub", "TxDb.Hsapiens.UCSC.hg38.refGene")) \ No newline at end of file diff --git a/tutorials/annotate_cell_types.qmd b/tutorials/annotate_cell_types.qmd index 72e1afc..7e1a822 100644 --- a/tutorials/annotate_cell_types.qmd +++ b/tutorials/annotate_cell_types.qmd @@ -1,25 +1,22 @@ # Tutorial 2: Access single-cell datasets from `scRNAseq` collection and annotate cell types -Welcome to this tutorial on annotating single-cell datasets with reference collections. The **scRNAseq** ([R/Bioc](https://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html), [Python](https://github.com/BiocPy/scrnaseq)) package provides access to public single-cell RNA-seq datasets for use by other Bioconductor/BiocPy packages and workflows. These datasets are stored in language agnostic representations described in [ArtifactDB](https://github.com/artifactdb), enabling easy access to datasets and analysis results across multiple programming languages such as R and Python. We will showcase how to integrate and process single-cell datasets across languages, such as R and Python, and how to annotate cell types using reference datasets. +Welcome to this tutorial on annotating single-cell datasets with reference collections. The **scRNAseq** ([R/Bioc](https://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html), [Python](https://github.com/BiocPy/scrnaseq)) package provides access to public single-cell RNA-seq datasets for use by other Bioconductor/BiocPy packages and workflows. These datasets are stored in language-agnostic representations described in [ArtifactDB](https://github.com/artifactdb), enabling easy access to datasets and analysis results across multiple programming languages such as R and Python. We will showcase how to integrate and process single-cell datasets across languages, such as R and Python, and how to annotate cell types using reference datasets. ## Outline In this tutorial, you will learn how to: -- Install and set up BiocPy packages in your Python environment. -- Explore the `scrnaseq` package and access public single-cell RNA-seq datasets. -- Perform basic operations on `SingleCellExperiment` objects, the core data structure for single-cell data analysis. -- Annotate cell types using reference datasets from the `celldex` package. -- Understand the design principles behind BiocPy. - -Let's dive into the process! +1. Install and set up BiocPy packages in your Python environment. +2. Explore the `scrnaseq` package and access public single-cell RNA-seq datasets. +3. Perform basic operations on `SingleCellExperiment` objects, the core data structure for single-cell data. +4. Annotate cell types using reference datasets from the `celldex` package. ## Prerequisites Before we begin, please ensure that you have the following prerequisites installed: -- Python 3.8 or later with dependencies listed [here]([../requirements.txt](https://github.com/BiocPy/BiocWorkshop2024/blob/master/requirements.txt)). -- R 4.4.0 and Bioconductor packages listed [here]([../rpackages.R](https://github.com/BiocPy/BiocWorkshop2024/blob/master/rpackages.R)). +- Python 3.8 or later with dependencies listed [here](https://github.com/BiocPy/BiocWorkshop2024/blob/master/requirements.txt). +- R 4.4.0 and Bioconductor packages listed [here](https://github.com/BiocPy/BiocWorkshop2024/blob/master/rpackages.R). ## Installation @@ -27,15 +24,6 @@ Let's start by installing the required packages. ::: {.panel-tabset} -## R -```r -BiocManager::install(c("scRNAseq", "celldex", "SingleR"), - repos='http://cran.us.r-project.org') -``` - -This will install the `scRNAseq`, `celldex`, `SingleR`, packages from Bioconductor. - - ## Python (shell) ```sh @@ -44,25 +32,26 @@ pip install scrnaseq celldex singler This will install the `scrnaseq`, `celldex`, `singler` packages from PyPI. +## R +```r +BiocManager::install(c("scRNAseq", "celldex", "SingleR"), + repos='http://cran.us.r-project.org') +``` + +This will install the `scRNAseq`, `celldex`, `SingleR`, packages from Bioconductor. + ::: ## Accessing and Exploring Single-Cell Datasets -Now that we have the necessary packages installed, let's explore the `scrnaseq` package and learn how to access public single-cell RNA-seq datasets. Dataset published to the `scrnaseq` package is decorated with metadata such as the study title, species, number of cells, etc., to facilitate discovery. Let's see how we can list and search for datasets. +Now that we have the necessary packages installed, let's explore the `scrnaseq` package and learn how to access public single-cell RNA-seq datasets. Datasets published to the `scrnaseq` package are decorated with metadata such as the study title, species, number of cells, etc., to facilitate discovery. Let's see how we can list and search for datasets. ### List All Datasets -The `list_datasets()` function in Python or `surveyDatasets()` in R will display all available datasets published to the `scRNAseq` repository along with their metadata. +The `list_datasets()` function in Python or `surveyDatasets()` in R will display all available datasets published to the `scRNAseq` collection along with their metadata. ::: {.panel-tabset} -## R -```{r} -suppressMessages(library(scRNAseq)) -all_ds <- surveyDatasets() -head(all_ds[, c("name", "title", "version")], 3) -``` - ## Python ```{python} import scrnaseq @@ -70,9 +59,16 @@ datasets = scrnaseq.list_datasets() datasets[["name", "title", "version"]].head(3) ``` +## R +```r +suppressMessages(library(scRNAseq)) +all_ds <- surveyDatasets() +head(all_ds[, c("name", "title", "version")], 3) +``` + ::: -This R|Python code lists all available datasets in the scrnaseq package and displays their names, titles, and versions. +This R|Python code lists all available datasets in the `scrnaseq` package and displays their names, titles, and versions. ### Search for Datasets @@ -80,12 +76,6 @@ You can also search for datasets based on metadata using `search_datasets()` in ::: {.panel-tabset} -## R -```{r} -pancreas_ds <- searchDatasets("pancreas") -head(pancreas_ds[, c("name", "title", "version")], 3) -``` - ## Python ```{python} import scrnaseq @@ -94,31 +84,26 @@ pancreas_datasets = scrnaseq.search_datasets("pancreas") pancreas_datasets[["name", "title", "version"]].head(3) ``` +## R +```r +pancreas_ds <- searchDatasets("pancreas") +head(pancreas_ds[, c("name", "title", "version")], 3) +``` + ::: This R|Python code searches for datasets containing the term "pancreas" and displays their names, titles, and versions. #### Advanced Searches -For more complex searches involving boolean operations, use `define_text_query()` in Python or `defineTextQuery()` in R. Here’s an example to find datasets using the mouse reference genome (`GRCm38`) and containing the words `neuro` or `pancrea`. +For more complex searches involving boolean operations, use `define_text_query()` in Python or `defineTextQuery()` in R. Here's an example to find datasets using the mouse reference genome (`GRCm38`) and containing the words `neuro` or `pancrea`. ::: {.callout-tip} -The `define_text_query()` function in Python or its equivalent `defineTextQuery()` in R helps us define complex queries. Check out the reference manual for more details. +Check out the reference manual for more details and usage of these functions. ::: ::: {.panel-tabset} -## R -```{r} -suppressWarnings(library(gypsum)) -res <- searchDatasets( - defineTextQuery("GRCm38", field="genome") & - (defineTextQuery("neuro%", partial=TRUE) | - defineTextQuery("pancrea%", partial=TRUE)) -) -head(res[,c("name", "title", "version")], 3) -``` - ## Python ```{python} from gypsum_client import define_text_query @@ -134,9 +119,19 @@ res = scrnaseq.search_datasets( res[["name", "title", "version"]].head(3) ``` +## R +```r +suppressWarnings(library(gypsum)) +res <- searchDatasets( + defineTextQuery("GRCm38", field="genome") & + (defineTextQuery("neuro%", partial=TRUE) | + defineTextQuery("pancrea%", partial=TRUE)) +) +head(res[,c("name", "title", "version")], 3) +``` ::: -This R|Python code performs a complex search to find datasets using the mouse reference genome and containing the words "neuro" or "pancrea". +This R|Python code performs a complex search to find datasets tagged as "mouse" in the reference genome field and containing the keywords "neuro" or "pancrea". ::: {.callout-important} Once a dataset is identified, always list the name and version of the dataset in your scripts for reproducibility. @@ -154,107 +149,120 @@ For this tutorial, let's download the `zeisel-brain` dataset: ::: {.panel-tabset} -## R -```{r} -sce <- fetchDataset("zeisel-brain-2015", "2023-12-14", realize.assays=TRUE) -sce -``` - ## Python ```{python} import scrnaseq -sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True) +sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14") print(sce) ``` + +## R +```r +sce <- fetchDataset("zeisel-brain-2015", "2023-12-14") +sce +``` + ::: -### Side-quest on SingleCellExperiment in Python +### Side-quest on `SingleCellExperiment` in Python -The Python implementation of the `SingleCellExperiment` class adheres to Bioconductor's specification and offers similar interface and methods. Our goal is to make it simple for analysts to switch between R and Python. Key differences include a shift from functional to object-oriented paradigms. +The Python implementation of the `SingleCellExperiment` class adheres to Bioconductor's specification and offers similar interface and methods. Our goal is to make it simple for analysts to switch between R and Python. A key difference is the shift from functional to an object-oriented paradigm. ::: {.callout-note} For more details on the design, refer to the [BiocPy developer guide](https://github.com/BiocPy/developer_guide) or the [singlecellexperiment](https://github.com/BiocPy/SingleCellExperiment) documentation. ::: -This Python code demonstrates basic operations on a `SingleCellExperiment` object, including retrieving assay names, column names, column metadata, accessing counts, and coercing to an `AnnData` object. +This Python code demonstrates basic operations on a `SingleCellExperiment` object, including retrieving assay names, column names, column metadata, accessing counts, and coercing to an `AnnData` object for interoperability with existing analysis ready eco-systems in Python. -```{python} -## repeated because quarto's build does not keep state of python snippets across the notebook. -import scrnaseq -sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14") +::: {.panel-tabset} + +## Python +```{python} print("assays: ", sce.get_assay_names()) # or sce.assay_names print("column names: ", sce.get_column_names()) # or sce.column_names -print("column metadata:", sce.get_column_data()) # or sce.column_data +print("column metadata: ", sce.get_column_data()) # or sce.column_data + +print("access counts: ", sce.assays["counts"]) # or # sce.assay("counts") +``` + +:::{.callout-note} +The package uses [delayed arrays](https://github.com/biocpy/delayedarray), to load file-backed arrays and matrices. This reduces memory usage when loading large datasets. This package provides similar functionality to the R/Bioconductor's [DelayedArray](https://www.bioconductor.org/packages/release/bioc/html/DelayedArray.html) eco-system. +::: + +```{python} +from delayedarray import to_scipy_sparse_matrix +print("counts as csr: ") +print(repr(to_scipy_sparse_matrix(sce.assays["counts"], "csc"))) +``` + +or realize the entire matrix when loaded from disk, -print("access counts ", sce.assays["counts"]) # or # sce.assay("counts") +```{python} +sce = scrnaseq.fetch_dataset( + "zeisel-brain-2015", "2023-12-14", + realize_assays=True) +print(sce) +``` -print("coerce to AnnData", sce.to_anndata()) +We also provide coercions to various package to take advantage of methods in the Python ecosystem, e.g. scverse and AnnData + +```{python} +print("coerce to AnnData: ", sce.to_anndata()) ``` +::: + ## Annotate Cell Types We can now annotate cell types by using reference datasets and matching cells based on their expression profiles. In this tutorial, we will use [singleR](https://github.com/SingleR-inc/SingleR) in R or its Python equivalent [singler](https://github.com/BiocPy/singler). -Before running the `singler` algorithm, we need to fetch reference datasets from the `celldex` package. +Before running the `singler` algorithm, we need to download an appropriate reference dataset from the `celldex` package. ### Access Reference Datasets from `celldex` -Similar to the `scRNAseq` package, the `celldex` package provides access to reference datasets in language-agnostic representations for use in downstream analyses. +Similar to the `scRNAseq` package, the `celldex` package provides access to the collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq to reference datasets. These datasets are also stored in language-agnostic representations for use in downstream analyses. ::: {.callout-note} The `celldex` package is available on [R/Bioconductor](https://bioconductor.org/packages/devel/data/experiment/html/celldex.html) and [PyPI](https://github.com/BiocPy/celldex). ::: -For this tutorial, let's download the [Immunological Genome Project](https://www.immgen.org/) reference from `celldex` using `fetch_reference()` in Python or `fetchReference()` in R. +For this tutorial, let's download the [Mouse RNA-seq](https://www.immgen.org/) reference from `celldex` using `fetch_reference()` in Python or `fetchReference()` in R. This reference consists of a collection of mouse bulk RNA-seq data sets downloaded from the gene expression omnibus ([Benayoun et al. 2019](https://doi.org/10.1101/gr.240093.118)). A variety of cell types are available, again mostly from blood but also covering several other tissues. ::: {.panel-tabset} -## R -```{r} -suppressWarnings(library(celldex)) -immgen_ref <- fetchReference("immgen", "2024-02-26", realize.assays=TRUE) -immgen_ref -``` - ## Python ```{python} import celldex -immgen_ref = celldex.fetch_reference("immgen", "2024-02-26", realize_assays=True) -print(immgen_ref) +mouse_rnaseq_ref = celldex.fetch_reference( + "mouse_rnaseq", "2024-02-26", + realize_assays=True) +print(mouse_rnaseq_ref) ``` +## R +```r +suppressWarnings(library(celldex)) +mouse_rnaseq_ref <- fetchReference("mouse_rnaseq", "2024-02-26", realize.assays=TRUE) +mouse_rnaseq_ref +``` ::: -Now, let's identify cells from the `zeisel-brain` dataset using the `immgen` reference dataset. +Now, let's identify cells from the `zeisel-brain` dataset using the `mouse_rnaseq` reference dataset. ::: {.panel-tabset} -## R -```{r} -suppressWarnings(library(SingleR)) -cell_labels <- SingleR(test = assay(sce, "counts"), ref = immgen_ref, labels = immgen_ref$label.main) - -table(cell_labels$labels) -``` - ## Python ```{python} import singler -import scrnaseq -sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True) - -import celldex -immgen_ref = celldex.fetch_reference("immgen", "2024-02-26", realize_assays=True) - matches = singler.annotate_single( test_data=sce, - ref_data = immgen_ref, + ref_data = mouse_rnaseq_ref, ref_labels = "label.main" ) @@ -263,103 +271,81 @@ import pandas as pd pd.Series(matches["best"]).value_counts() ``` -Note: Since the python snippets use reticulate when built through Quarto, it does keep the history. Hence the code chunk -is longer. +## R +```r +suppressWarnings(library(SingleR)) +cell_labels <- SingleR(test = assay(sce, "counts"), ref = mouse_rnaseq_ref, labels = mouse_rnaseq_ref$label.main) + +table(cell_labels$labels) +``` + ::: -## Visualizing Single-Cell Data +## Analyze Single-cell RNA-seq datasets -I can't have a tutorial without a section on visualization or figures. +![single-cell-methods](../assets/single-cell-space.jpg) -TODO: generate embeddings and then visualize clusters +Aaron has implemented the single-cell methods from scran in C++. This allows us to reuse the same implementation in JS and develop applications for analyzing single-cell data ([Kana](https://github.com/kanaverse/kana)), or in Python through the [scranpy](https://github.com/BiocPy/scranpy) package. This avoids different interpretations of the analysis results by switching programming languages (Pachter et al, [The impact of package selection and versioning on single-cell RNA-seq analysis | bioRxiv](https://www.biorxiv.org/content/10.1101/2024.04.04.588111v1) ) -::: {.panel-tabset} +To analyze the dataset, -## R +```{python} +import scranpy -We will use the ggplot2 package in R to create visualizations. First, let's visualize the cell type annotations. +results = scranpy.analyze_sce(sce) -```{r} -suppressWarnings(library(SingleR)) -suppressWarnings(library(ggplot2)) -cell_labels <- SingleR(test = assay(sce, "counts"), ref = immgen_ref, labels = immgen_ref$label.main) -sce$labels <- cell_labels$labels - -ggplot(as.data.frame(colData(sce)), aes(x = labels)) + - geom_bar() + - theme(axis.text.x = element_text(angle = 45, hjust = 1)) + - labs(title = "Cell Type Annotations", x = "Cell Type", y = "Count") +# results is a complex object, lets explore the umap and tsne dimensions +print(results.tsne) ``` +### Seems like magic? -## Python +Running the `analyze_sce()` function uses the default parameters to run the single-cell workflow. If you want to customize or want to have fine-grained control on the analysis steps, set the parameter `dry_run=True`. -We will use the seaborn and matplotlib packages in Python to create visualizations. First, let's visualize the cell type annotations. +:::{.callout-note} +This prints out the exact series of steps the function does under the hood to perform the analysis. You can then use this to customize the analysis to your specific dataset or use case. +::: ```{python} -import seaborn as sns -import matplotlib.pyplot as plt -import pandas as pd -import singler - -import scrnaseq -sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True) - -import celldex -immgen_ref = celldex.fetch_reference("immgen", "2024-02-26", realize_assays=True) - -matches = singler.annotate_single( - test_data=sce, - ref_data=immgen_ref, - ref_labels="label.main" -) - -cell_labels = pd.Series(matches["best"]).value_counts() - -sns.barplot(x=cell_labels.index, y=cell_labels.values) -plt.xticks(rotation=45, ha='right') -plt.title("Cell Type Annotations") -plt.xlabel("Cell Type") -plt.ylabel("Count") -plt.show() +print(scranpy.analyze_sce(sce, dry_run=True)) ``` +:::{.callout-tip} +Users can also run individual steps from the analysis without having to perform the full analysis, e.g. compute log normalized counts or find markers, etc. ::: -## Homework: Performing Differential Expression Analysis +## Visualize Results -Differential expression analysis helps identify genes that are differentially expressed between different cell types or conditions. Let's explore how to identify markers for various cell types. +I can't have a tutorial without a section on visualization or figures. -### Differential Expression Analysis in Python +We will use the seaborn and matplotlib packages in Python to create visualizations. We'll plot the t-SNE embedding and color the cells by their cluster assignments. -We will use the scanpy package in Python to perform differential expression analysis. +```{python} +import seaborn as sns +sns.scatterplot( + x=results.tsne.x, y=results.tsne.y, + hue=results.clusters, palette="deep" +) +``` -```python -import scanpy as sc +or the UMAP embedding with the cell types we identified from `celldex` -import scrnaseq -sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True) -import celldex -immgen_ref = celldex.fetch_reference("immgen", "2024-02-26", realize_assays=True) - -import singler -matches = singler.annotate_single( - test_data=sce, - ref_data=immgen_ref, - ref_labels="label.main" +```{python} +import seaborn as sns +sns.scatterplot( + x=results.umap.x, y=results.umap.y, + hue=matches["best"][:3002], palette="deep" ) +``` -# Prepare the data -adata = sce.to_anndata() -adata.obs['labels'] = matches["best"] +:::{.callout-caution} +During the QC step, some cells were filtered, hence we filter the matches and this is incorrect (since we don't know which cells were filtered). -# Perform differential expression analysis -sc.tl.rank_genes_groups(adata, groupby='labels', method='t-test') -sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False) -``` +We'll leave this as an exercise for the reader to change the order of steps: 1) run the dataset through the QC step 2) filter cells, and then 3) annotate using singleR. +::: Congratulations! You have now completed the tutorial on accessing single-cell datasets using `scRNAseq` and `ArtifactDB`, and annotating cell types using reference datasets from `celldex`. For more detailed usage and advanced analyses, refer to the respective documentation of these packages. -By integrating R and Python workflows, you can leverage the strengths of both languages and perform comprehensive single-cell analysis. Keep exploring and happy analyzing! +By integrating R and Python workflows, you can leverage the strengths of both languages and perform comprehensive single-cell analysis. Keep exploring and happy analyzing! \ No newline at end of file diff --git a/tutorials/genomic_ranges.qmd b/tutorials/genomic_ranges.qmd new file mode 100644 index 0000000..27975ee --- /dev/null +++ b/tutorials/genomic_ranges.qmd @@ -0,0 +1,424 @@ +# Tutorial 1: `GenomicRanges` and range-based analyses + +Genomic range operations are fundamental to many bioinformatics analyses. They allow us to work with intervals of genomic coordinates, which is crucial for understanding the relationships between different genomic features such as genes, regulatory elements, and experimental data like ChIP-seq peaks. In this tutorial, we'll explore how to work with genomic interval data using BiocPy's [GenomicRanges](https://github.com/BiocPy/GenomicRanges/) package, which provides Python implementations similar to the R/Bioconductor [GenomicRanges package](https://github.com/Bioconductor/GenomicRanges). + +## Outline + +In this workshop, we'll walk through several key aspects of working with genomic ranges in Python: + +1. Reading Genomic Data: We'll start by reading in genomic data from RDS files, including exon positions grouped by transcripts. +2. Basic Genomic Operations: We'll cover fundamental operations like finding transcription start sites (TSS) and promoter regions. +3. Overlap Analysis: We'll learn how to find overlaps between different genomic features, a common task in many analyses. +4. Advanced Operations: We'll explore more complex operations like finding peaks within specific regions and resizing genomic intervals. + +## Prerequisites + +Before we begin, please ensure that you have the following packages installed: + +### Installation + +Let's start by installing the required packages for R and Python. + +::: {.panel-tabset} + +## Python (shell) + +You can install the Python packages using pip: + +```bash +pip install -U biocutils genomicranges rds2py numpy pandas geniml +``` + +## R + +```r +BiocManager::install(c("AnnotationHub"), + repos='http://cran.us.r-project.org') +``` + +::: + +## 1. Save Annotations as RDS + +Let's download the human reference genome and save the exon positions grouped by transcripts. +We need to do a bit of pre-processing to get this information. + +For the purpose of the tutorial, we'll limit the exons to chromosome 22. + +::: {.panel-tabset} + +## R + +```r +suppressMessages(library(AnnotationHub)) + +ah <- AnnotationHub() +ensdb <- query(ah, "Ensembl 112 EnsDb for Homo sapiens")[[1]] +exons_by_tx <- exonsBy(ensdb, + by = "tx", filter = SeqNameFilter(c("22")), + columns= c("exon_id", "tx_name", "tx_id", "gene_name", "gene_id")) +saveRDS(exons_by_tx, "hg38_exons_by_tx.rds") +``` + +::: + +## 2. Reading RDS files in Python + +The [rds2py](https://github.com/biocpy/rds2py) Python package allows us to read RDS files and create equivalent Python representations of R objects. Key features include: + +- Parsing common R objects into Python equivalents (e.g., matrices to NumPy arrays, data frames to Pandas DataFrames). +- Ability to read S4 classes, enabling direct parsing of Bioconductor data types from R to Python. + +Reading an RDS file with rds2py involves two steps: + +1. Parse the RDS file into a Python dictionary containing data, its attributes, and associated metadata. +2. Convert this dictionary into a suitable Python object using specific parser functions. + +This process allows a seamless transition between R and Python for bioinformatics analyses. + +::: {.panel-tabset} + +## Python +```{python} +from rds2py import read_rds +hg38_robject = read_rds("./hg38_exons_by_tx.rds") + +from rds2py.granges import as_granges_list +by_tx = as_granges_list(hg38_robject) + +print("Exons by transcript:") +print(by_tx) +``` + +::: + +## 2. Basic Genomic Operations + +Now, let's perform some basic operations like finding Transcription Start Sites (TSS) and promoter regions. These operations are fundamental in genomic analysis as they help us identify key regulatory regions of the genome. + +### 2.1 Create a `GenomicRangesList` by gene + +To identify TSS or define promoter regions, let's first reprocess the input to create a `GenomicRangesList` by gene symbols. + +::: {.panel-tabset} + +## Python + +To achieve this, we unlist the `GenomicRangesList` object. This is accomplished in Python using the `as_genomic_ranges()` method. + +```{python} +all_ranges = by_tx.as_genomic_ranges() +``` + +Then we split the object using the `gene_name` metadata column in `mcols()`. + +:::{.callout-important} +We provide accessors to get or set attributes of the class. Most folks in Python may be familiar with direct access to class members (via properties or @property), but this should generally be avoided, as it is too easy to perform modifications via one-liners with the class.property on the left-hand side of an assignment. + +For more information, please refer to our [developer guide](https://github.com/BiocPy/developer_guide). +::: + +:::{.callout-note} +While gene IDs are unique, gene symbols are not. In addition, this list has genes with no symbols. +::: + +```{python} +by_gene = all_ranges.split( + groups=all_ranges.get_mcols().get_column("gene_name") +) + +print("Exons by gene:") +print(by_gene) +``` +::: + +### 2.2 Finding Transcription Start Sites (TSS) + +Transcription Start Sites (TSS) are the locations where transcription of a gene begins. Identifying TSS is crucial for understanding gene regulation, as many regulatory elements are located near the TSS. + +::: {.panel-tabset} + +Let's use the `range()` method to get the full extent of each gene. + +## Python + +```{python} +ranges_by_gene = by_gene.range() + +print("Gene ranges:") +print(ranges_by_gene) +``` + +We convert the list to a `GenomicRanges` object. + +```{python} +gr_by_gene = ranges_by_gene.as_genomic_ranges() + +print("as GenomicRanges:") +print(gr_by_gene) +``` + +Then we resize to a width of 1 base pair at the start of each range to pinpoint the TSS. + +```{python} +tss = gr_by_gene.resize(width=1, fix="start") + +print("Transcript Start Sites:") +print(gr_by_gene) +``` +::: + +### 2.3 Defining Promoter Regions + +Here, we're defining promoters as the region 2000 base pairs upstream to 200 base pairs downstream of each TSS. This definition can vary depending on the specific analysis, but this range often captures important regulatory elements. + +::: {.panel-tabset} + +## Python +```{python} +promoters = tss.promoters(upstream=2000, downstream=200) + +print("Promoter Regions:") +print(promoters) +``` + +:::{.callout-note} +Please be aware that because gene symbols may not be unique, this GenomicRanges object might contain duplicates. You might want to resolve duplicate symbols by making the symbols unique. We will leave this as an exercise for the reader. +::: + +::: + +## 3. Overlap Analysis + +A common task in genomic analysis is finding overlaps between different genomic features. This helps us understand the relationships between various elements in the genome and can provide insights into gene regulation and function. + +### 3.1 Reading ChIP-seq Peaks + +ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a method used to identify binding sites of DNA-associated proteins. The peaks represent regions where a protein of interest is likely bound to the DNA. We're focusing on chromosome 22 for this example to keep the dataset manageable. + +For the purpose of this tutorial, let's download a bed file containing peaks from a ChIP-seq experiment on "Human B cells" to identify "EZH2" binding sites (from ENCODE) and cataloged in [bedbase.org](https://bedbase.org/bed/be4054acf6e3feeb4dc490e6430e358e). + +::: {.panel-tabset} + +## Python +```{python} +from geniml.bbclient import BBClient + +bbclient = BBClient(cache_folder="cache", bedbase_api="https://api.bedbase.org") +bedfile_id = "be4054acf6e3feeb4dc490e6430e358e" +bedfile = bbclient.load_bed(bedfile_id) +peaks = bedfile.to_granges() + +filter_chr22 = [x == "chr22" for x in peaks.get_seqnames()] +peaks_chr22 = peaks[filter_chr22] + +print(peaks_chr22) +``` + +::: + +### 3.2 Finding Overlaps with TSS + +Here, we're identifying ChIP-seq peaks that overlap with TSS. This analysis can help us understand if the protein of interest tends to bind near the start of genes, which could suggest a role in transcription initiation. + +::: {.panel-tabset} + +## Python + +```{python} +overlaps = peaks_chr22.find_overlaps(tss) + +print("Peak indices that overlap with first 10 TSS:") +print(overlaps[:10]) +``` + +:::{.callout-note} +`find_overlaps` returns a `list` with the same length as TSS, indicating which indices from peaks overlap with each of the TSS. Ideally, we would want to return a `Hits` object similar to the Bioconductor implementation. + +**TODO: Future plans to convert this into a `Hits` object.** +::: + +Let's identify the peaks that overlap with TSS. + +```{python} +import itertools + +all_indices = list(set(itertools.chain.from_iterable(overlaps))) +peaks_by_tss = peaks_chr22[all_indices] +print(peaks_by_tss) +``` + +Instead, one can subset peaks that overlap with TSS using the `subset_by_overlaps` method: + +```{python} +peaks_by_tss2 = peaks_chr22.subset_by_overlaps(tss) +print(peaks_by_tss2) +``` + +Additionally, in some cases, we may want to ignore strand information (`ignore_strand=True`) when finding overlaps. + +```{python} +peaks_by_tss_ignoring_strand = peaks_chr22.subset_by_overlaps(tss, ignore_strand=True) +print(peaks_by_tss_ignoring_strand) +``` + +:::{.callout-note} +This yields the same results for this particular scenario, but may not if the 'peaks' contain strand information. +::: + +::: + +### 3.3 Finding Overlaps with Promoters + +This operation finds ChIP-seq peaks that overlap with our defined promoter regions. If a significant number of peaks fall within promoters, it might suggest that the protein plays a role in gene regulation through promoter binding. This kind of analysis is often used to characterize the binding patterns of transcription factors or other regulatory proteins. + +::: {.panel-tabset} + +## Python + +```{python} +peaks_by_promoters = peaks_chr22.subset_by_overlaps(promoters) + +print("Peaks Overlapping with Promoters:") +print(peaks_by_promoters) +``` + +::: + +### 3.4 Finding Overlaps with Exons + +Another analysis is to look at overlaps with all exons. This can help identify potential roles of the ChIP-seq peaks in splicing. Let's modify our analysis to look at all exons: + +::: {.panel-tabset} + +## Python + +```{python} +# Combine all exons into a single GenomicRanges object +all_exons = by_gene.as_granges() + +# Find peaks overlapping with any exon +peaks_by_exons = peaks_chr22.subset_by_overlaps(all_exons) + +print("Peaks overlapping with exons:") +print(peaks_by_exons) + +# Calculate the percentage of peaks that overlap with exons +percent_overlapping = (len(peaks_by_exons) / len(peaks_chr22)) * 100 + +print(f"Percentage of peaks overlapping with exons: {percent_overlapping:.2f}%") +``` + +::: + +This analysis can provide insights into whether the protein of interest (captured by the ChIP-seq) tends to bind within gene bodies, potentially influencing gene expression, splicing, or other co-transcriptional processes. + +## 4. Advanced Operations + +Let's explore some more complex operations that are often used in genomic analyses. + +### 4.1 Comparing Exonic vs. Intronic Binding + +Let's first identify intron regions. We will use the `by_gene` object we created that contains a `GenomicRangesList` split by gene. + +::: {.panel-tabset} + +## Python + +```{python} +# Create intronic regions (regions within genes but not in exons) +gene_ranges = by_gene.range().as_genomic_ranges() # Get the full extent of each gene +introns = gene_ranges.subtract(all_exons).as_granges() + +print("Intron regions:") +print(introns) +``` + +To gain further insight, we can compare the proportion of peaks overlapping with exons to those overlapping with introns: + +```{python} +# Find peaks overlapping with introns +peaks_by_introns = peaks_chr22.subset_by_overlaps(introns) + +print("Peaks overlapping with introns:") +print(peaks_by_introns) + +# Calculate percentages +percent_exonic = (len(peaks_by_exons) / len(peaks_chr22)) * 100 +percent_intronic = (len(peaks_by_introns) / len(peaks_chr22)) * 100 + +print(f"Percentage of peaks overlapping with exons: {percent_exonic:.2f}%") +print(f"Percentage of peaks overlapping with introns: {percent_intronic:.2f}%") +``` + +::: + +:::{.callout-note} +These percentages add up to over 100% because some peaks overlap both introns and exons, depending on how wide the peaks are. Ideally, you may want to filter the peaks based on preference as you annotate them with TSS, promoters, etc. +::: + +This comparison can help determine if the protein of interest shows a preference for binding in exonic or intronic regions, which could suggest different functional roles (e.g., splicing regulation for exonic binding vs. potential enhancer activity for intronic binding). + +### 4.2 Finding Overlaps with the first exon + +:::{.callout-note} +- This analysis is performed by transcript. +- The rationale for this analysis may vary, but we are mostly showcasing complex genomic operations that are possible with the package. +::: + +Let's first put together a `GenomicRanges` object containing the first exon for each transcript. + +```{python} +all_first = [] +for txid, grl in by_tx: + strand = grl.get_strand(as_type = "list")[0] + if strand == "-": + all_first.append(grl.sort()[-1]) + else: + all_first.append(grl.sort()[0]) +``` + +Then we combine all the individual genomic elements. The [biocutils](https://github.com/BiocPy/BiocUtils) package provides utilities for convenient aspects of R that aren't provided by base Python and generics. One of these generics is the `'combine'` operation that merges or concatenates various Bioconductor classes. + +```{python} +from biocutils import combine_sequences +first_exons = combine_sequences(*all_first) +``` + +We can now subset peaks that overlap with the first exon + +```{python} +peaks_with_first_exons = peaks_chr22.subset_by_overlaps(first_exons) +print(peaks_with_first_exons) +``` + +### 4.3 Resizing and Shifting Peaks + +```{python} +narrow_peaks = peaks_chr22.narrow(start=10, width=100) +shifted_peaks = narrow_peaks.shift(10) + +print("Narrowed and Shifted Peaks:") +print(shifted_peaks) +``` + +Resizing and shifting genomic ranges can be useful in various contexts. For example: + +- Narrowing peaks might help focus on the center of ChIP-seq binding sites. +- Shifting ranges can be used to look at regions adjacent to your features of interest. e.g. defining the predicted CRISPR cleavage site based on the position of the CRISPR gRNA sequence. + +These operations demonstrate the flexibility of genomic range manipulations, which can be useful for fine-tuning analyses or testing hypotheses about the spatial relationships between genomic features. + +## 5. Exercises + +1. Calculate the average width of the ChIP-seq peaks on chromosome 22. +2. Determine how many peaks overlap with CpG islands. +3. Compute the percentage of promoter regions that have at least one overlapping ChIP-seq peak. + +## Conclusion + +In this tutorial, we've explored how to use BiocPy's genomic ranges functionality to perform various genomic analyses. These tools and techniques provide a powerful way to work with genomic interval data in Python, mirroring the capabilities from Bioconductor. They form the foundation for many more complex genomic analyses and can be applied to a wide range of biological questions. + +:::{.callout-note} +Refer to the [BiocPy documentation](https://biocpy.github.io/) for more detailed information on these packages and their functionalities. +::: \ No newline at end of file diff --git a/tutorials/hg38_exons_by_tx.rds b/tutorials/hg38_exons_by_tx.rds new file mode 100644 index 0000000..16ff55c Binary files /dev/null and b/tutorials/hg38_exons_by_tx.rds differ diff --git a/tutorials/sessioninfo.qmd b/tutorials/sessioninfo.qmd index 5a03a28..18d6e2c 100644 --- a/tutorials/sessioninfo.qmd +++ b/tutorials/sessioninfo.qmd @@ -1,4 +1,4 @@ -# Reproduce me! {.unnumbered} +# Session Info! {.unnumbered} The code base for this repository is available at [https://github.com/BiocPy/tutorial](https://github.com/BiocPy/tutorial). @@ -16,25 +16,26 @@ from rich import print ## Python -Lets make sure we have all packages we need for this section +The Python version on the GitHub runner ```{python} -sys.version_info +print(sys.version_info) ``` ## Packages +Versions of packages installed for generating this build ```{python} import math -import biocframe -import biocutils import genomicranges import summarizedexperiment import singlecellexperiment import multiassayexperiment import rds2py +import celldex +import scrnaseq import session_info