Update documentation and README (#6)

BiocPy · May 27, 2024 · 8ba06ad · 8ba06ad
1 parent 5d2ebd2
commit 8ba06ad
Show file tree

Hide file tree

Showing 8 changed files with 165 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -11,17 +11,148 @@
 -->
 
 [![Project generated with PyScaffold](https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold)](https://pyscaffold.org/)
+[![PyPI-Server](https://img.shields.io/pypi/v/scrnaseq.svg)](https://pypi.org/project/scrnaseq/)
 
 # scrnaseq
 
-> Add a short description here!
+The `scRNAseq` package provides convenient access to several publicly available single-cell datasets in the form of [SingleCellExperiment](https://github.com/biocpy/singlecellexperiment) objects. Users can obtain a `SingleCellExperiment` and transform it into analysis-ready representations for immediate use.
+
+To enable discovery, each dataset is decorated with metadata such as the study title/abstract, the species, the number of cells, etc. Users can also contribute their own published datasets to enable re-use by the wider Bioconductor/BiocPy community.
+
+**Also check out the R version of this library [here@scRNAseq](https://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html) published to Bioconductor.**
+
+## Find Datasets
+
+The `list_datasets()` function will display all available datasets along with their metadata. This can be used to discover interesting datasets for further analysis.
+
+```python
+import scrnaseq
+datasets = scrnaseq.list_datasets()
+```
+
+This returns a pandas `DataFrame` to easily filter and download datasets of interest.
+
+## Fetch Datasets
+
+The `fetch_dataset()` function will download a particular dataset, as `SingleCellExperiment`:
+
+```python
+sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14")
+print(sce)
+```
+
+For studies that generate multiple datasets, the dataset of interest must be explicitly requested via the `path` argument:
+
+```python
+sce = scrnaseq.fetch_dataset("baron-pancreas-2016", "2023-12-14", path="human")
+print(sce)
+```
+
+By default, array data is loaded as a file-backed `DelayedArray` from the [HDF5Array](https://github.com/BiocPy/HDF5Array) package. Setting `realize_assays=True` and/or `realize_reduced_dims=True` will coerce file-backed arrays to numpy or scipy sparse (csr/csc) objects.
+
+```python
+sce = scrnaseq.fetch_dataset("baron-pancreas-2016", "2023-12-14", path="human", realize_assays=True)
+print(sce)
+```
+
+Users can also fetch the metadata associated with each dataset:
+
+```python
+meta = scrnaseq.fetch_metadata("zeisel-brain-2015", "2023-12-14")
+```
+
+## Adding New Datasets
+
+Want to contribute your own dataset to this package? It's easy! Just follow these simple steps:
+
+1. Format your dataset as a `SummarizedExperiment` or `SingleCellExperiment`. Let's mock a dataset:
 
-A longer description of your project goes here...
+     ```python
+     import numpy as np
+     from singlecellexperiment import SingleCellExperiment
+     from biocframe import BiocFrame
 
+     mat = np.random.poisson(1, (100, 10))
+     row_names = [f"GENE_{i}" for i in range(mat.shape[0])]
+     col_names = list("ABCDEFGHIJ")
+     sce = SingleCellExperiment(
+          assays={"counts": mat},
+          row_data=BiocFrame(row_names=row_names),
+          column_data=BiocFrame(row_names=col_names),
+     )
+     ```
+
+2. Assemble the metadata for your dataset. This should be a dictionary as specified in the [Bioconductor metadata schema](https://github.com/ArtifactDB/bioconductor-metadata-index). Check out some examples from `fetch_metadata()`  Note that the `application_takane` property will be automatically added later, and so can be omitted from the list that you create.
+
+     ```python
+     meta = {
+          "title": "My dataset forked from ziesel brain",
+          "description": "This is a copy of the ziesel",
+          "taxonomy_id": ["10090"],  # NCBI ID
+          "genome": ["GRCh38"],  # genome build
+          "sources": [{"provider": "GEO", "id": "GSE12345"}],
+          "maintainer_name": "Shizuka Mogami",
+          "maintainer_email": "[email protected]",
+     }
+     ```
+
+3. Save your `SummarizedExperiment` or `SingleCellExperiment` object to disk with `save_dataset()`. This saves the dataset into a "staging directory" using language-agnostic file formats - check out the [ArtifactDB](https://github.com/artifactdb) framework for more details. In more complex cases involving multiple datasets, users may save each dataset into a subdirectory of the staging directory.
+
+     ```python
+     import tempfile
+     from scrnaseq import save_dataset
+
+     # replace tmp with a staging directory
+     staging_dir = tempfile.mkdtemp()
+     save_dataset(sce, staging_dir, meta)
+     ```
+
+     You can check that everything was correctly saved by reloading the on-disk data for inspection:
+
+     ```python
+     import dolomite_base as dl
+
+     dl.read_object(staging_dir)
+     ```
+
+4. Open a [pull request (PR)](https://github.com/BiocPy/scRNAseq/pulls) for the addition of a new dataset. You will need to provide a few things here:
+   - The name of your dataset. This typically follows the format of `{NAME}-{SYSTEM}-{YEAR}`, where `NAME` is the last name of the first author of the study, `SYSTEM` is the biological system (e.g., tissue, cell types) being studied, and `YEAR` is the year of publication for the dataset.
+   - The version of your dataset. This is usually just the current date or whenever you started putting together the dataset for upload. The exact date doesn't really matter as long as we can establish a timeline for later versions.
+   - An Python file containing the code used to assemble the dataset. This should be added to the [`scripts/`](https://github.com/BiocPy/scRNAseq/tree/master/scripts) directory of this package, in order to provide some record of how the dataset was created.
+
+5. Wait for us to grant temporary upload permissions to your GitHub account.
+
+6. Upload your staging directory to [**gypsum** backend](https://github.com/ArtifactDB/gypsum-worker) with `upload_dataset()`. On the first call to this function, it will automatically prompt you to log into GitHub so that the backend can authenticate you. If you are on a system without browser access (e.g., most computing clusters), a [token](https://github.com/settings/tokens) can be manually supplied via `set_access_token()`.
+
+     ```python
+     from scrnaseq import upload_dataset
+
+     upload_dataset(staging_dir, "my_dataset_name", "my_version")
+     ```
+
+     You can check that everything was successfully uploaded by calling `fetch_dataset()` with the same name and version:
+
+     ```python
+     from scrnaseq import upload_dataset
+
+     fetch_dataset("my_dataset_name", "my_version")
+     ```
+
+     If you realized you made a mistake, no worries. Use the following call to clear the erroneous dataset, and try again:
+
+     ```python
+     from gypsum_client import reject_probation
+
+     reject_probation("scRNAseq", "my_dataset_name", "my_version")
+     ```
+
+7. Comment on the PR to notify us that the dataset has finished uploading and you're happy with it. We'll review it and make sure everything's in order. If some fixes are required, we'll just clear the dataset so that you can upload a new version with the necessary changes. Otherwise, we'll approve the dataset. Note that once a version of a dataset is approved, no further changes can be made to that version; you'll have to upload a new version if you want to modify something.
 
 <!-- pyscaffold-notes -->
 
 ## Note
 
+The tests for upload are skipped. To run them, include a GitHub token as an environment variable `gh_token`.
+
 This project has been set up using PyScaffold 4.5. For details and usage
 information on PyScaffold see https://pyscaffold.org/.
diff --git a/docs/conf.py b/docs/conf.py
@@ -171,7 +171,7 @@
 
 # The theme to use for HTML and HTML Help pages.  See the documentation for
 # a list of builtin themes.
-html_theme = "alabaster"
+html_theme = "furo"
 
 # Theme options are theme-specific and customize the look and feel of a theme
 # further.  For a list of options available for each theme, see the
@@ -249,6 +249,16 @@
 # Output file base name for HTML help builder.
 htmlhelp_basename = "scrnaseq-doc"
 
+autodoc_default_options = {
+    # 'members': 'var1, var2',
+    # 'member-order': 'bysource',
+    "special-members": True,
+    "undoc-members": True,
+    "exclude-members": "__weakref__, __dict__, __str__, __module__",
+}
+
+autosummary_generate = True
+autosummary_imported_members = True
 
 # -- Options for LaTeX output ------------------------------------------------
 
@@ -299,6 +309,13 @@
     "scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
     "setuptools": ("https://setuptools.pypa.io/en/stable/", None),
     "pyscaffold": ("https://pyscaffold.org/en/stable", None),
+    "biocframe": ("https://biocpy.github.io/BiocFrame", None),
+    "genomicranges": ("https://biocpy.github.io/GenomicRanges", None),
+    "singlecellexperiment": ("https://biocpy.github.io/SingleCellExperiment", None),
+    "summarizedexperiment": ("https://biocpy.github.io/SummarizedExperiment", None),
+    "gypsum_client": ("https://artifactdb.github.io/gypsum-py", None),
+    "delayedarray": ("https://biocpy.github.io/DelayedArray", None),
+    "dolomite_base": ("https://artifactdb.github.io/dolomite-base", None),
 }
 
 print(f"loading configurations for {project} {version} ...", file=sys.stderr)
diff --git a/docs/index.md b/docs/index.md
@@ -1,17 +1,10 @@
 # scrnaseq
 
-Add a short description here!
+The `scRNAseq` package provides convenient access to several publicly available single-cell datasets in the form of [SingleCellExperiment](https://github.com/biocpy/singlecellexperiment) objects. Users can obtain a `SingleCellExperiment` and transform it into analysis-ready representations for immediate use.
 
+To enable discovery, each dataset is decorated with metadata such as the study title/abstract, the species, the number of cells, etc. Users can also contribute their own published datasets to enable re-use by the wider Bioconductor/BiocPy community.
 
-## Note
-
-> This is the main page of your project's [Sphinx] documentation. It is
-> formatted in [Markdown]. Add additional pages by creating md-files in
-> `docs` or rst-files (formatted in [reStructuredText]) and adding links to
-> them in the `Contents` section below.
->
-> Please check [Sphinx] and [MyST] for more information
-> about how to document your project and how to configure your preferences.
+**Also check out the R version of this library [here@scRNAseq](https://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html) published to Bioconductor.**
 
 
 ## Contents
@@ -20,11 +13,11 @@ Add a short description here!
 :maxdepth: 2
 
 Overview <readme>
+Module Reference <api/modules>
 Contributions & Help <contributing>
 License <license>
 Authors <authors>
 Changelog <changelog>
-Module Reference <api/modules>
 ```
 
 ## Indices and tables

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -4,3 +4,5 @@
 # sphinx_rtd_theme
 myst-parser[linkify]
 sphinx>=3.2.1
+furo
+sphinx-autodoc-typehints
diff --git a/src/scrnaseq/fetch_dataset.py b/src/scrnaseq/fetch_dataset.py
@@ -34,7 +34,7 @@ def fetch_dataset(
         :py:func:`~gypsum_client.upload_file_operations.upload_directory`,
         to save and upload a dataset.
 
-        :py:func:`~scrnaseq.survey_datasets.survey_datasets` and :py:func:`~scrnaseq.list_versions.list_versions`,
+        :py:func:`~scrnaseq.list_datasets.list_datasets` and :py:func:`~scrnaseq.list_versions.list_versions`,
         to get possible values for `name` and `version`.
 
     Example:

diff --git a/src/scrnaseq/list_datasets.py b/src/scrnaseq/list_datasets.py
@@ -38,7 +38,7 @@ def list_datasets(
             Defaults to True.
 
     Returns:
-        A pandas DataFrame where each row corresponds to a dataset.
+        A :py:class:`~pandas.DataFrame` where each row corresponds to a dataset.
         Each row contains title and description for each dataset,
         the number of rows and columns, the organisms and genome builds involved,
         whether the dataset has any pre-computed reduced dimensions, and so on.

diff --git a/src/scrnaseq/save_dataset.py b/src/scrnaseq/save_dataset.py
@@ -29,7 +29,8 @@ def save_dataset(x: Any, path, metadata):
 
         metadata:
             Dictionary containing the metadata for this dataset.
-            see the schema returned by :py:func:`~gypsum.fetch_metadata_schema`.
+            see the schema returned by 
+            :py:func:`~gypsum_client.fetch_metadata_schema.fetch_metadata_schema`.
 
             Note that the ``applications.takane`` property will be automatically
             added by this function and does not have to be supplied.
@@ -41,7 +42,7 @@ def save_dataset(x: Any, path, metadata):
         :py:func:`~scrnaseq.polish_dataset.polish_dataset`,
         to polish ``x`` before saving it.
 
-        :py:func:`~gypsum.upload_directory`, to upload the saved contents.
+        :py:func:`~scrnaseq.upload_dataset.upload_dataset`, to upload the saved contents.
 
     Example:
 

diff --git a/src/scrnaseq/utils.py b/src/scrnaseq/utils.py
@@ -94,10 +94,11 @@ def format_object_metadata(x) -> dict:
     """Format object related metadata.
 
     Create object-related metadata to validate against the default
-    schema from :py:func:`~gypsum.fetch_metadata_schema`.
+    schema from 
+    :py:func:`~gypsum_client.fetch_metadata_schema.fetch_metadata_schema`.
     This is intended for downstream package developers who are
     auto-generating metadata documents to be validated by
-    :py:func:`~gypsum.validate_metadata`.
+    :py:func:`~gypsum_client.validate_metadata.validate_metadata`.
 
     Args:
         x: