Skip to content

Commit

Permalink
Minor edits to most of the tutorial. Re-organize various sections (#8)
Browse files Browse the repository at this point in the history
  • Loading branch information
jkanche authored Feb 25, 2024
1 parent 5ab8303 commit 64f87cb
Show file tree
Hide file tree
Showing 19 changed files with 220 additions and 81 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ docs
_freeze
.jupyter_cache/

chapters/zilinoislung_with_celltypist/
chapters/zilinois_lung_with_celltypist/
*whee.h5
16 changes: 8 additions & 8 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ execute:
cache: true

book:
title: "BiocPy: Enabling Bioconductor workflows in Python"
title: "BiocPy: Facilitate Bioconductor Workflows in Python"
author: "[Jayaram Kancherla](mailto:[email protected])"
contributor: "[Aaron Lun](mailto:[email protected])"
favicon: ./assets/short.png
Expand All @@ -31,17 +31,17 @@ book:
- index.qmd
- part: chapters/representations/index.qmd
chapters:
- chapters/representations/atomics.qmd
- chapters/representations/biocframe.qmd
- chapters/representations/genomicranges.qmd
- chapters/representations/delayedarrays.qmd
- chapters/representations/filebackedarrays.qmd
- chapters/representations/genomic_ranges.qmd
- chapters/representations/delayed_arrays.qmd
- chapters/representations/file_backed_arrays.qmd
- chapters/representations/atomics.qmd
- part: chapters/experiments/index.qmd
chapters:
- chapters/experiments/summarized_expt.qmd
- chapters/experiments/singlecell_expt.qmd
- chapters/experiments/summarized_experiment.qmd
- chapters/experiments/single_cell_experiment.qmd
- chapters/experiments/extending_se.qmd
- chapters/experiments/multiassay_expt.qmd
- chapters/experiments/multi_assay_experiment.qmd
- chapters/interop.qmd
- chapters/language_agnostic.qmd
- chapters/workflow.qmd
Expand Down
24 changes: 23 additions & 1 deletion chapters/experiments/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,26 @@ BiocPy provides (currently) three classes to represented experimental data. This

- `SummarizedExperiment` ([GitHub](https://github.com/BiocPy/SummarizedExperiment), [Docs](https://biocpy.github.io/SummarizedExperiment/), [BioC](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html)): Container class to represent genomic experiments, following Bioconductor's [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html).
- `SingleCellExperiment` ([GitHub](https://github.com/BiocPy/SingleCellExperiment), [Docs](https://biocpy.github.io/SingleCellExperiment/), [BioC](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html)): Container class to represent single-cell experiments; follows Bioconductor’s [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html).
- `MultiAssayExperiment` ([GitHub](https://github.com/BiocPy/MultiAssayExperiment), [Docs](https://biocpy.github.io/MultiAssayExperiment/), [BioC](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html)): Container class to represent multiple experiments and assays performed over a set of samples. follows Bioconductor's [MAE R/Bioc Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html).
- `MultiAssayExperiment` ([GitHub](https://github.com/BiocPy/MultiAssayExperiment), [Docs](https://biocpy.github.io/MultiAssayExperiment/), [BioC](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html)): Container class to represent multiple experiments and assays performed over a set of samples. follows Bioconductor's [MAE R/Bioc Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html).

## Install packages

The [biocpy](https://github.com/BiocPy/BiocPy) package serves as a convenient wrapper that installs all the core packages within the ecosystem.

```bash
pip install biocpy
```

Alternatively, you can install specific packages as required. For example:

```bash
pip install summarizedexperiment # <package-name>
```

## Update packages

To update packages, use the following command:

```bash
pip install -U summarizedexperiment # or <package-name>
```
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

`MultiAssayExperiment` (MAE) simplifies the management of multiple experimental assays conducted on a shared set of specimens.

:::{.callout-note}
These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](../philosophy.qmd#functional-discipline) section.
:::

## Installation

To get started, install the package from [PyPI](https://pypi.org/project/multiassayexperiment/)
Expand Down Expand Up @@ -117,7 +121,11 @@ print(mae)

If both `column_data` and `sample_map` are `None`, the constructor naively creates sample mapping, with each `experiment` considered to be a independent `sample`. We add a sample to `column_data` in this pattern - ``unknown_sample_{experiment_name}``.

All cells from the each experiment are considered to be from the same sample and is reflected in `sample_map`. ***This is not a recommended approach, but if you don’t have sample mapping, then it doesn’t matter***.
All cells from the each experiment are considered to be from the same sample and is reflected in `sample_map`.

:::{.callout-important}
***This is not a recommended approach, but if you don’t have sample mapping, then it doesn’t matter***.
:::

```{python}
mae = MultiAssayExperiment(
Expand Down Expand Up @@ -211,15 +219,19 @@ One can access an experiment by name:
print(mae.experiment("se"))
```

Additionally you may access an experiment with the sample information included in the column data of the experiment. Note, this creates a copy of the experiment:
Additionally you may access an experiment with the sample information included in the column data of the experiment:

:::{.callout-note}
This creates a copy of the experiment.
:::

```{python}
expt_with_sample_info = mae.experiment("se", with_sample_data=True)
print(expt_with_sample_info)
```

:::{.callout-note}
For consistency with the R MAE's interface, we also provide `get_with_col_data` method, that performs the same operation.
For consistency with the R MAE's interface, we also provide `get_with_column_data` method, that performs the same operation.
:::

### Setters
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ This package provides container class to represent single-cell experimental data
The design of `SingleCellExperiment` class and its derivates adheres to the R/Bioconductor specification, where rows correspond to features, and columns represent cells.
:::

:::{.callout-note}
These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](../philosophy.qmd#functional-discipline) section.
:::

## Installation

To get started, install the package from [PyPI](https://pypi.org/project/singlecellexperiment/)
Expand All @@ -16,17 +20,19 @@ pip install singlecellexperiment

## Construction

The `SingleCellExperiment` extends `RangeSummarizedExperiment` and contains few additional attributes:
The `SingleCellExperiment` extends `RangeSummarizedExperiment` and contains additional attributes:

- `reduced_dims`: Slot for low-dimensionality embeddings for each cell.
- `alternative_experiments`: Manages multi-modal experiments performed on the same sample or set of cells.
- `row_pairs` or `column_pairs`: Stores relationships between features or cells.

:::{.callout-note}
In contrast to R, matrices in Python are unnamed and do not contain row or column names. Hence, these matrices cannot be directly used as values in assays or alternative experiments. We strictly enforce type checks in these cases. To relax these restrictions for alternative experiments, set `type_check_alternative_experiments` to `False`.
:::

:::{.callout-important}
If you are using the `alternative_experiments` slot, the number of cells must match the parent experiment. Otherwise, the expectation is that the cells do not share the same sample or annotations and cannot be set in alternative experiments!

Note: Validation checks do not apply to ``row_pairs`` or ``col_pairs``.
:::

Before we construct a `SingleCellExperiment` object, lets generate information about rows, columns and a mock experimental data from single-cell rna-seq experiments:

Expand Down Expand Up @@ -89,6 +95,12 @@ sce = SingleCellExperiment(
print(sce)
```


:::{.callout-tip}
You can also use delayed or file-backed arrays for representing experimental data, check out [this section](./summarized_experiment.qmd#delayed-or-file-backed-arrays) from summarized experiment.
:::


### Interop with `anndata`

We provide convenient methods for loading an `AnnData` or `h5ad` file into `SingleCellExperiment` objects.
Expand Down Expand Up @@ -127,9 +139,9 @@ sce_h5 = read_h5ad("../../assets/data/adata.h5ad")
print(sce_h5)
```

### from tenx formats
### From 10X formats

In addition, we also provide convenient methods to load a 10X H5 file. We currently only support version 3 of the 10X H5 format.
In addition, we also provide convenient methods to load a [10X Genomics HDF5 Feature-Barcode Matrix Format](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/outputs/cr-outputs-h5-matrices) file.

```{python}
from singlecellexperiment import read_tenx_h5
Expand All @@ -145,7 +157,7 @@ Methods are also available to read a 10x matrix market directory using the `read

Getters are available to access various attributes using either the property notation or functional style.

Since `SingleCellExperiment` extends `RangedSummarizedExperiment`, all getters and setters from the base class are accessible here; more details [here](./summarized_expt.qmd).
Since `SingleCellExperiment` extends `RangedSummarizedExperiment`, all getters and setters from the base class are accessible here; more details [here](./summarized_experiment.qmd).

```{python}
# access assay names
Expand Down Expand Up @@ -176,14 +188,14 @@ print(subset_sce)
```


## Combining experiments {#sec-sce-combine}
## Combining experiments

`SingleCellExperiment` implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils).
`SingleCellExperiment` implements methods for the `combine` generic from [**BiocUtils**](https://github.com/BiocPy/biocutils).

These methods enable the merging or combining of multiple `SingleCellExperiment` objects, allowing users to aggregate data from different experiments or conditions. Note: `row_pairs` and `column_pairs` are not ignored as part of this operation.


To demonstrate, let's create multiple `SingleCellExperiment` objects (read more about this in [combine section from `SummarizedExperiment`](./summarized_expt.qmd#combining-experiments)).
To demonstrate, let's create multiple `SingleCellExperiment` objects (read more about this in [combine section from `SummarizedExperiment`](./summarized_experiment.qmd#combining-experiments)).

```{python}
#| code-fold: true
Expand Down Expand Up @@ -289,7 +301,9 @@ sce_combined = combine_columns(sce2, sce1)
print(sce_combined)
```

:::{.callout-note}
You can use `relaxed_combine_columns` or `relaxed_combined_rows` when there's mismatch in the number of features or samples. Missing rows or columns in any object are filled in with appropriate placeholder values before combining, e.g. missing assay's are replaced with a masked numpy array.
:::

```{python}
# sce_alts1 contains an additional assay not present in sce_alts2
Expand All @@ -298,7 +312,7 @@ print(sce_relaxed_combine)
```


## Export as `MuData`
## Export as `AnnData` or `MuData`

The package also provides methods to convert a `SingleCellExperiment` object into a `MuData` representation:

Expand All @@ -307,6 +321,14 @@ mdata = sce.to_mudata()
mdata
```

or coerce to an `AnnData` object:

```{python}
adata, alts = sce.to_anndata()
print("main experiment: ", adata)
print("alternative experiments: ", alts)
```

----

## Further reading
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,16 @@

This package provides containers to represent genomic experimental data as 2-dimensional matrices. In these matrices, the rows typically denote features or genomic regions of interest, while columns represent samples or cells.

The package currently includes representations for both `SummarizedExperiment` and `RangedSummarizedExperiment`. A distinction lies in the fact that the rows of a `RangedSummarizedExperiment` object are expected to be `GenomicRanges` (tutorial [here](../representations/genomicranges.qmd)), representing genomic regions of interest.
The package currently includes representations for both `SummarizedExperiment` and `RangedSummarizedExperiment`. A distinction lies in the fact that the rows of a `RangedSummarizedExperiment` object are expected to be `GenomicRanges` (tutorial [here](../representations/genomic_ranges.qmd)), representing genomic regions of interest.

:::{.callout-important}
The design of `SummarizedExperiment` class and its derivates adheres to the R/Bioconductor specification, where rows correspond to features, and columns represent samples or cells.
:::

:::{.callout-note}
These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](../philosophy.qmd#functional-discipline) section.
:::

## Installation

To get started, install the package from [PyPI](https://pypi.org/project/summarizedexperiment/)
Expand All @@ -24,14 +28,18 @@ A `SummarizedExperiment` contains three key attributes,
- `row_data`: Feature information e.g. genes, transcripts, exons, etc.
- `column_data`: Sample information about the columns of the matrices.

:::{.callout-important}
Both `row_data` and `column_data` are expected to be [BiocFrame](../representations/biocframe.qmd) objects and will be coerced to a `BiocFrame` for consistent downstream operations.
:::

In addition, these classes can optionally accept `row_names` and `column_names`. Since `row_data` and `column_data` may also contain names, the following rules are used in the implementation:

- On **construction**, if `row_names` or `column_names` are not provided, these are automatically inferred from `row_data` and `column_data` objects.
- On **extraction** of these objects, the `row_names` in `row_data` and `column_data` are replaced by the equivalents from the SE level.
- On **accessors** of these objects, the `row_names` in `row_data` and `column_data` are replaced by the equivalents from the SE level.
- On **setters** for these attributes, especially with the functional style (`set_row_data` and `set_column_data` methods), additional options are available to replace the names in the SE object.

:::{.callout-note}
This avoids unexpected mdifications in names, when either `row_data` or `column_data` objects are modified.
:::{.callout-caution}
These rules help avoid unexpected mdifications in names, when either `row_data` or `column_data` objects are modified.
:::

To construct a `SummarizedExperiment`, we'll first generate a matrix of read counts, representing the read counts from a series of RNA-seq experiments. Following that, we'll create a `BiocFrame` object to denote feature information and a table for column annotations. This table may include the names for the columns and any other values we wish to represent.
Expand Down Expand Up @@ -80,7 +88,7 @@ col_data = pd.DataFrame(
```

:::{.callout-note}
The inputs `row_data` and `column_data` are expected to be `BiocFrame` objects and will be coerced to a `BiocFrame` if a pandas `DataFrame` is supplied.
The inputs `row_data` and `column_data` are expected to be `BiocFrame` objects and will be coerced to a `BiocFrame` if a pandas `DataFrame` is supplied.
:::

Now, we can construct a `SummarizedExperiment` from this information.
Expand Down Expand Up @@ -110,9 +118,9 @@ rse = RangedSummarizedExperiment(
print(rse)
```

## Delayed arrays
## Delayed or file-backed arrays

The general idea is that `DelayedArray`s are a drop-in replacement for NumPy arrays, at least for [BiocPy](https://github.com/BiocPy) applications. Learn more about [delayed arrays here](../extras/delayedarrays.qmd).
The general idea is that `DelayedArray`'s are a drop-in replacement for NumPy arrays, at least for [BiocPy](https://github.com/BiocPy) applications. Learn more about [delayed arrays here](../representations/delayed_arrays.qmd).

For example, we can use the `DelayedArray` inside a `SummarizedExperiment`:

Expand Down Expand Up @@ -145,7 +153,7 @@ print(adata)
```

:::{.callout-tip}
To convert an `AnnData` object to a BiocPy representation, utilize the `from_anndata` method in the [SingleCellExperiment](./singlecell_expt.qmd) class. This minimizes the loss of information when converting between these two representations.
To convert an `AnnData` object to a BiocPy representation, utilize the `from_anndata` method in the [SingleCellExperiment](./single_cell_experiment.qmd) class. This minimizes the loss of information when converting between these two representations.
:::

## Getters/Setters
Expand Down Expand Up @@ -246,7 +254,11 @@ An `Exception` is raised if a names does not exist.

### Subset by boolean vector

Similarly, you can also slice by a boolean array. Note that the boolean vectors should contain the same number of features for the row slice and the same number of samples for the column slice.
Similarly, you can also slice by a boolean array.

:::{.callout-important}
Note that the boolean vectors should contain the same number of features for the row slice and the same number of samples for the column slice.
:::

```{python}
subset_se_with_bools = se_with_names[[True, True, False], [True, False, True]]
Expand All @@ -257,6 +269,10 @@ print(subset_se_with_bools)

This is a feature not a bug :), you can specify an empty list to completely remove all rows or samples.

:::{.callout-warning}
An empty array (`[]`) is not the same as an empty slice (`:`). This helps us avoid unintented operations.
:::

```{python}
subset = se_with_names[:2, []]
print(subset)
Expand All @@ -265,9 +281,9 @@ print(subset)

## Range-based operations

Additionally, since `RangeSummarizedExperiment` contain `row_ranges`, this allows us to perform a number of range based operations that are possible on a `GenomicRanges` object.
Additionally, since `RangeSummarizedExperiment` contains `row_ranges`, this allows us to perform a number of range-based operations that are possible on a `GenomicRanges` object.

For example, to subset `RangeSummarizedExperiment` with a query set of regions:
For example, to subset `RangeSummarizedExperiment` with a **query** set of regions:

```{python}
from iranges import IRanges
Expand All @@ -279,9 +295,9 @@ print(result)

Additionally, RSE supports many other interval based operations. Checkout the [documentation](https://biocpy.github.io/SummarizedExperiment/api/modules.html) for more details.

## Combining experiments {#sec-se-combine}
## Combining experiments

`SummarizedExperiment` implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils).
`SummarizedExperiment` implements methods for the `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils).

These methods enable the merging or combining of multiple `SummarizedExperiment` objects, allowing users to aggregate data from different experiments or conditions. To demonstrate, let's create multiple `SummarizedExperiment` objects.

Expand Down Expand Up @@ -371,7 +387,11 @@ print(se2)
print(se3)
```

The `combine_rows` or `combine_columns` operations, expect all experiments to contain the same assay names. To combine experiments by row:
:::{.callout-important}
The `combine_rows` or `combine_columns` operations, expect all experiments to contain the same assay names.
:::

To combine experiments by row:

```{python}
from biocutils import relaxed_combine_columns, combine_columns, combine_rows, relaxed_combine_rows
Expand All @@ -386,7 +406,9 @@ se_combined = combine_columns(se2, se1)
print(se_combined)
```

:::{.callout-important}
You can use `relaxed_combine_columns` or `relaxed_combined_rows` when there's mismatch in the number of features or samples. Missing rows or columns in any object are filled in with appropriate placeholder values before combining, e.g. missing assay's are replaced with a masked numpy array.
:::

```{python}
# se3 contains an additional assay not present in se1
Expand Down
Loading

0 comments on commit 64f87cb

Please sign in to comment.