Skip to content

Commit

Permalink
Update documentation and README (#6)
Browse files Browse the repository at this point in the history
  • Loading branch information
jkanche authored May 27, 2024
1 parent 5d2ebd2 commit 8ba06ad
Show file tree
Hide file tree
Showing 8 changed files with 165 additions and 20 deletions.
135 changes: 133 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,148 @@
-->

[![Project generated with PyScaffold](https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold)](https://pyscaffold.org/)
[![PyPI-Server](https://img.shields.io/pypi/v/scrnaseq.svg)](https://pypi.org/project/scrnaseq/)

# scrnaseq

> Add a short description here!
The `scRNAseq` package provides convenient access to several publicly available single-cell datasets in the form of [SingleCellExperiment](https://github.com/biocpy/singlecellexperiment) objects. Users can obtain a `SingleCellExperiment` and transform it into analysis-ready representations for immediate use.

To enable discovery, each dataset is decorated with metadata such as the study title/abstract, the species, the number of cells, etc. Users can also contribute their own published datasets to enable re-use by the wider Bioconductor/BiocPy community.

**Also check out the R version of this library [here@scRNAseq](https://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html) published to Bioconductor.**

## Find Datasets

The `list_datasets()` function will display all available datasets along with their metadata. This can be used to discover interesting datasets for further analysis.

```python
import scrnaseq
datasets = scrnaseq.list_datasets()
```

This returns a pandas `DataFrame` to easily filter and download datasets of interest.

## Fetch Datasets

The `fetch_dataset()` function will download a particular dataset, as `SingleCellExperiment`:

```python
sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14")
print(sce)
```

For studies that generate multiple datasets, the dataset of interest must be explicitly requested via the `path` argument:

```python
sce = scrnaseq.fetch_dataset("baron-pancreas-2016", "2023-12-14", path="human")
print(sce)
```

By default, array data is loaded as a file-backed `DelayedArray` from the [HDF5Array](https://github.com/BiocPy/HDF5Array) package. Setting `realize_assays=True` and/or `realize_reduced_dims=True` will coerce file-backed arrays to numpy or scipy sparse (csr/csc) objects.

```python
sce = scrnaseq.fetch_dataset("baron-pancreas-2016", "2023-12-14", path="human", realize_assays=True)
print(sce)
```

Users can also fetch the metadata associated with each dataset:

```python
meta = scrnaseq.fetch_metadata("zeisel-brain-2015", "2023-12-14")
```

## Adding New Datasets

Want to contribute your own dataset to this package? It's easy! Just follow these simple steps:

1. Format your dataset as a `SummarizedExperiment` or `SingleCellExperiment`. Let's mock a dataset:

A longer description of your project goes here...
```python
import numpy as np
from singlecellexperiment import SingleCellExperiment
from biocframe import BiocFrame

mat = np.random.poisson(1, (100, 10))
row_names = [f"GENE_{i}" for i in range(mat.shape[0])]
col_names = list("ABCDEFGHIJ")
sce = SingleCellExperiment(
assays={"counts": mat},
row_data=BiocFrame(row_names=row_names),
column_data=BiocFrame(row_names=col_names),
)
```

2. Assemble the metadata for your dataset. This should be a dictionary as specified in the [Bioconductor metadata schema](https://github.com/ArtifactDB/bioconductor-metadata-index). Check out some examples from `fetch_metadata()` Note that the `application_takane` property will be automatically added later, and so can be omitted from the list that you create.

```python
meta = {
"title": "My dataset forked from ziesel brain",
"description": "This is a copy of the ziesel",
"taxonomy_id": ["10090"], # NCBI ID
"genome": ["GRCh38"], # genome build
"sources": [{"provider": "GEO", "id": "GSE12345"}],
"maintainer_name": "Shizuka Mogami",
"maintainer_email": "[email protected]",
}
```

3. Save your `SummarizedExperiment` or `SingleCellExperiment` object to disk with `save_dataset()`. This saves the dataset into a "staging directory" using language-agnostic file formats - check out the [ArtifactDB](https://github.com/artifactdb) framework for more details. In more complex cases involving multiple datasets, users may save each dataset into a subdirectory of the staging directory.

```python
import tempfile
from scrnaseq import save_dataset

# replace tmp with a staging directory
staging_dir = tempfile.mkdtemp()
save_dataset(sce, staging_dir, meta)
```

You can check that everything was correctly saved by reloading the on-disk data for inspection:

```python
import dolomite_base as dl

dl.read_object(staging_dir)
```

4. Open a [pull request (PR)](https://github.com/BiocPy/scRNAseq/pulls) for the addition of a new dataset. You will need to provide a few things here:
- The name of your dataset. This typically follows the format of `{NAME}-{SYSTEM}-{YEAR}`, where `NAME` is the last name of the first author of the study, `SYSTEM` is the biological system (e.g., tissue, cell types) being studied, and `YEAR` is the year of publication for the dataset.
- The version of your dataset. This is usually just the current date or whenever you started putting together the dataset for upload. The exact date doesn't really matter as long as we can establish a timeline for later versions.
- An Python file containing the code used to assemble the dataset. This should be added to the [`scripts/`](https://github.com/BiocPy/scRNAseq/tree/master/scripts) directory of this package, in order to provide some record of how the dataset was created.

5. Wait for us to grant temporary upload permissions to your GitHub account.

6. Upload your staging directory to [**gypsum** backend](https://github.com/ArtifactDB/gypsum-worker) with `upload_dataset()`. On the first call to this function, it will automatically prompt you to log into GitHub so that the backend can authenticate you. If you are on a system without browser access (e.g., most computing clusters), a [token](https://github.com/settings/tokens) can be manually supplied via `set_access_token()`.

```python
from scrnaseq import upload_dataset

upload_dataset(staging_dir, "my_dataset_name", "my_version")
```

You can check that everything was successfully uploaded by calling `fetch_dataset()` with the same name and version:

```python
from scrnaseq import upload_dataset

fetch_dataset("my_dataset_name", "my_version")
```

If you realized you made a mistake, no worries. Use the following call to clear the erroneous dataset, and try again:

```python
from gypsum_client import reject_probation

reject_probation("scRNAseq", "my_dataset_name", "my_version")
```

7. Comment on the PR to notify us that the dataset has finished uploading and you're happy with it. We'll review it and make sure everything's in order. If some fixes are required, we'll just clear the dataset so that you can upload a new version with the necessary changes. Otherwise, we'll approve the dataset. Note that once a version of a dataset is approved, no further changes can be made to that version; you'll have to upload a new version if you want to modify something.

<!-- pyscaffold-notes -->

## Note

The tests for upload are skipped. To run them, include a GitHub token as an environment variable `gh_token`.

This project has been set up using PyScaffold 4.5. For details and usage
information on PyScaffold see https://pyscaffold.org/.
19 changes: 18 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = "alabaster"
html_theme = "furo"

# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
Expand Down Expand Up @@ -249,6 +249,16 @@
# Output file base name for HTML help builder.
htmlhelp_basename = "scrnaseq-doc"

autodoc_default_options = {
# 'members': 'var1, var2',
# 'member-order': 'bysource',
"special-members": True,
"undoc-members": True,
"exclude-members": "__weakref__, __dict__, __str__, __module__",
}

autosummary_generate = True
autosummary_imported_members = True

# -- Options for LaTeX output ------------------------------------------------

Expand Down Expand Up @@ -299,6 +309,13 @@
"scipy": ("https://docs.scipy.org/doc/scipy/reference", None),
"setuptools": ("https://setuptools.pypa.io/en/stable/", None),
"pyscaffold": ("https://pyscaffold.org/en/stable", None),
"biocframe": ("https://biocpy.github.io/BiocFrame", None),
"genomicranges": ("https://biocpy.github.io/GenomicRanges", None),
"singlecellexperiment": ("https://biocpy.github.io/SingleCellExperiment", None),
"summarizedexperiment": ("https://biocpy.github.io/SummarizedExperiment", None),
"gypsum_client": ("https://artifactdb.github.io/gypsum-py", None),
"delayedarray": ("https://biocpy.github.io/DelayedArray", None),
"dolomite_base": ("https://artifactdb.github.io/dolomite-base", None),
}

print(f"loading configurations for {project} {version} ...", file=sys.stderr)
15 changes: 4 additions & 11 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,10 @@
# scrnaseq

Add a short description here!
The `scRNAseq` package provides convenient access to several publicly available single-cell datasets in the form of [SingleCellExperiment](https://github.com/biocpy/singlecellexperiment) objects. Users can obtain a `SingleCellExperiment` and transform it into analysis-ready representations for immediate use.

To enable discovery, each dataset is decorated with metadata such as the study title/abstract, the species, the number of cells, etc. Users can also contribute their own published datasets to enable re-use by the wider Bioconductor/BiocPy community.

## Note

> This is the main page of your project's [Sphinx] documentation. It is
> formatted in [Markdown]. Add additional pages by creating md-files in
> `docs` or rst-files (formatted in [reStructuredText]) and adding links to
> them in the `Contents` section below.
>
> Please check [Sphinx] and [MyST] for more information
> about how to document your project and how to configure your preferences.
**Also check out the R version of this library [here@scRNAseq](https://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html) published to Bioconductor.**


## Contents
Expand All @@ -20,11 +13,11 @@ Add a short description here!
:maxdepth: 2
Overview <readme>
Module Reference <api/modules>
Contributions & Help <contributing>
License <license>
Authors <authors>
Changelog <changelog>
Module Reference <api/modules>
```

## Indices and tables
Expand Down
2 changes: 2 additions & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@
# sphinx_rtd_theme
myst-parser[linkify]
sphinx>=3.2.1
furo
sphinx-autodoc-typehints
2 changes: 1 addition & 1 deletion src/scrnaseq/fetch_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def fetch_dataset(
:py:func:`~gypsum_client.upload_file_operations.upload_directory`,
to save and upload a dataset.
:py:func:`~scrnaseq.survey_datasets.survey_datasets` and :py:func:`~scrnaseq.list_versions.list_versions`,
:py:func:`~scrnaseq.list_datasets.list_datasets` and :py:func:`~scrnaseq.list_versions.list_versions`,
to get possible values for `name` and `version`.
Example:
Expand Down
2 changes: 1 addition & 1 deletion src/scrnaseq/list_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def list_datasets(
Defaults to True.
Returns:
A pandas DataFrame where each row corresponds to a dataset.
A :py:class:`~pandas.DataFrame` where each row corresponds to a dataset.
Each row contains title and description for each dataset,
the number of rows and columns, the organisms and genome builds involved,
whether the dataset has any pre-computed reduced dimensions, and so on.
Expand Down
5 changes: 3 additions & 2 deletions src/scrnaseq/save_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ def save_dataset(x: Any, path, metadata):
metadata:
Dictionary containing the metadata for this dataset.
see the schema returned by :py:func:`~gypsum.fetch_metadata_schema`.
see the schema returned by
:py:func:`~gypsum_client.fetch_metadata_schema.fetch_metadata_schema`.
Note that the ``applications.takane`` property will be automatically
added by this function and does not have to be supplied.
Expand All @@ -41,7 +42,7 @@ def save_dataset(x: Any, path, metadata):
:py:func:`~scrnaseq.polish_dataset.polish_dataset`,
to polish ``x`` before saving it.
:py:func:`~gypsum.upload_directory`, to upload the saved contents.
:py:func:`~scrnaseq.upload_dataset.upload_dataset`, to upload the saved contents.
Example:
Expand Down
5 changes: 3 additions & 2 deletions src/scrnaseq/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,10 +94,11 @@ def format_object_metadata(x) -> dict:
"""Format object related metadata.
Create object-related metadata to validate against the default
schema from :py:func:`~gypsum.fetch_metadata_schema`.
schema from
:py:func:`~gypsum_client.fetch_metadata_schema.fetch_metadata_schema`.
This is intended for downstream package developers who are
auto-generating metadata documents to be validated by
:py:func:`~gypsum.validate_metadata`.
:py:func:`~gypsum_client.validate_metadata.validate_metadata`.
Args:
x:
Expand Down

0 comments on commit 8ba06ad

Please sign in to comment.