Minor edits to most of the tutorial. Re-organize various sections (#8)

BiocPy · Feb 25, 2024 · 64f87cb · 64f87cb
1 parent 5ab8303
commit 64f87cb
Show file tree

Hide file tree

Showing 19 changed files with 220 additions and 81 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,4 +4,5 @@ docs
 _freeze
 .jupyter_cache/
 
-chapters/zilinoislung_with_celltypist/
+chapters/zilinois_lung_with_celltypist/
+*whee.h5
diff --git a/_quarto.yml b/_quarto.yml
@@ -7,7 +7,7 @@ execute:
   cache: true
 
 book:
-  title: "BiocPy: Enabling Bioconductor workflows in Python"
+  title: "BiocPy: Facilitate Bioconductor Workflows in Python"
   author: "[Jayaram Kancherla](mailto:[email protected])"
   contributor: "[Aaron Lun](mailto:[email protected])"
   favicon: ./assets/short.png
@@ -31,17 +31,17 @@ book:
     - index.qmd
     - part: chapters/representations/index.qmd
       chapters:
-        - chapters/representations/atomics.qmd
         - chapters/representations/biocframe.qmd
-        - chapters/representations/genomicranges.qmd
-        - chapters/representations/delayedarrays.qmd
-        - chapters/representations/filebackedarrays.qmd
+        - chapters/representations/genomic_ranges.qmd
+        - chapters/representations/delayed_arrays.qmd
+        - chapters/representations/file_backed_arrays.qmd
+        - chapters/representations/atomics.qmd
     - part: chapters/experiments/index.qmd
       chapters:
-        - chapters/experiments/summarized_expt.qmd
-        - chapters/experiments/singlecell_expt.qmd
+        - chapters/experiments/summarized_experiment.qmd
+        - chapters/experiments/single_cell_experiment.qmd
         - chapters/experiments/extending_se.qmd
-        - chapters/experiments/multiassay_expt.qmd
+        - chapters/experiments/multi_assay_experiment.qmd
     - chapters/interop.qmd
     - chapters/language_agnostic.qmd
     - chapters/workflow.qmd

diff --git a/chapters/experiments/index.qmd b/chapters/experiments/index.qmd
@@ -4,4 +4,26 @@ BiocPy provides (currently) three classes to represented experimental data. This
 
 - `SummarizedExperiment` ([GitHub](https://github.com/BiocPy/SummarizedExperiment), [Docs](https://biocpy.github.io/SummarizedExperiment/), [BioC](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html)): Container class to represent genomic experiments, following Bioconductor's [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html).
 - `SingleCellExperiment` ([GitHub](https://github.com/BiocPy/SingleCellExperiment), [Docs](https://biocpy.github.io/SingleCellExperiment/), [BioC](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html)): Container class to represent single-cell experiments; follows Bioconductor’s [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html).
-- `MultiAssayExperiment` ([GitHub](https://github.com/BiocPy/MultiAssayExperiment), [Docs](https://biocpy.github.io/MultiAssayExperiment/), [BioC](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html)): Container class to represent multiple experiments and assays performed over a set of samples. follows Bioconductor's [MAE R/Bioc Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html).
+- `MultiAssayExperiment` ([GitHub](https://github.com/BiocPy/MultiAssayExperiment), [Docs](https://biocpy.github.io/MultiAssayExperiment/), [BioC](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html)): Container class to represent multiple experiments and assays performed over a set of samples. follows Bioconductor's [MAE R/Bioc Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html).
+
+## Install packages
+
+The [biocpy](https://github.com/BiocPy/BiocPy) package serves as a convenient wrapper that installs all the core packages within the ecosystem.
+
+```bash
+pip install biocpy
+```
+
+Alternatively, you can install specific packages as required. For example:
+
+```bash
+pip install summarizedexperiment # <package-name>
+```
+
+## Update packages
+
+To update packages, use the following command:
+
+```bash
+pip install -U summarizedexperiment # or <package-name>
+```
diff --git a/chapters/experiments/multiassay_expt.qmd → ...rs/experiments/multi_assay_experiment.qmd b/chapters/experiments/multiassay_expt.qmd → ...rs/experiments/multi_assay_experiment.qmd
@@ -2,6 +2,10 @@
 
 `MultiAssayExperiment` (MAE) simplifies the management of multiple experimental assays conducted on a shared set of specimens. 
 
+:::{.callout-note}
+These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](../philosophy.qmd#functional-discipline) section.
+:::
+
 ## Installation
 
 To get started, install the package from [PyPI](https://pypi.org/project/multiassayexperiment/)
@@ -117,7 +121,11 @@ print(mae)
 
 If both `column_data` and `sample_map` are `None`, the constructor naively creates sample mapping, with each `experiment` considered to be a independent `sample`. We add a sample to `column_data` in this pattern - ``unknown_sample_{experiment_name}``. 
 
-All cells from the each experiment are considered to be from the same sample and is reflected in `sample_map`. ***This is not a recommended approach, but if you don’t have sample mapping, then it doesn’t matter***.
+All cells from the each experiment are considered to be from the same sample and is reflected in `sample_map`. 
+
+:::{.callout-important}
+***This is not a recommended approach, but if you don’t have sample mapping, then it doesn’t matter***.
+:::
 
 ```{python}
 mae = MultiAssayExperiment(
@@ -211,15 +219,19 @@ One can access an experiment by name:
 print(mae.experiment("se"))
 ```
 
-Additionally you may access an experiment with the sample information included in the column data of the experiment. Note, this creates a copy of the experiment:
+Additionally you may access an experiment with the sample information included in the column data of the experiment:
+
+:::{.callout-note}
+This creates a copy of the experiment.
+:::
 
 ```{python}
 expt_with_sample_info = mae.experiment("se", with_sample_data=True)
 print(expt_with_sample_info)
 ```
 
 :::{.callout-note}
-For consistency with the R MAE's interface, we also provide `get_with_col_data` method, that performs the same operation.
+For consistency with the R MAE's interface, we also provide `get_with_column_data` method, that performs the same operation.
 :::
 
 ### Setters

diff --git a/chapters/experiments/singlecell_expt.qmd → ...rs/experiments/single_cell_experiment.qmd b/chapters/experiments/singlecell_expt.qmd → ...rs/experiments/single_cell_experiment.qmd
@@ -6,6 +6,10 @@ This package provides container class to represent single-cell experimental data
 The design of `SingleCellExperiment` class and its derivates adheres to the R/Bioconductor specification, where rows correspond to features, and columns represent cells.
 :::
 
+:::{.callout-note}
+These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](../philosophy.qmd#functional-discipline) section.
+:::
+
 ## Installation
 
 To get started, install the package from [PyPI](https://pypi.org/project/singlecellexperiment/)
@@ -16,17 +20,19 @@ pip install singlecellexperiment
 
 ## Construction
 
-The `SingleCellExperiment` extends `RangeSummarizedExperiment` and contains few additional attributes:
+The `SingleCellExperiment` extends `RangeSummarizedExperiment` and contains additional attributes:
 
 - `reduced_dims`: Slot for low-dimensionality embeddings for each cell.
 - `alternative_experiments`: Manages multi-modal experiments performed on the same sample or set of cells.
 - `row_pairs` or `column_pairs`: Stores relationships between features or cells.
 
+:::{.callout-note}
 In contrast to R, matrices in Python are unnamed and do not contain row or column names. Hence, these matrices cannot be directly used as values in assays or alternative experiments. We strictly enforce type checks in these cases. To relax these restrictions for alternative experiments, set `type_check_alternative_experiments` to `False`.
+:::
 
+:::{.callout-important}
 If you are using the `alternative_experiments` slot, the number of cells must match the parent experiment.  Otherwise, the expectation is that the cells do not share the same sample or annotations and cannot be set in alternative experiments!
-
-Note: Validation checks do not apply to ``row_pairs`` or ``col_pairs``.
+:::
 
 Before we construct a `SingleCellExperiment` object, lets generate information about rows, columns and a mock experimental data from single-cell rna-seq experiments:
 
@@ -89,6 +95,12 @@ sce = SingleCellExperiment(
 print(sce)
 ```
 
+
+:::{.callout-tip}
+You can also use delayed or file-backed arrays for representing experimental data, check out [this section](./summarized_experiment.qmd#delayed-or-file-backed-arrays) from summarized experiment.
+:::
+
+
 ### Interop with `anndata`
 
 We provide convenient methods for loading an `AnnData` or `h5ad` file into `SingleCellExperiment` objects.
@@ -127,9 +139,9 @@ sce_h5 = read_h5ad("../../assets/data/adata.h5ad")
 print(sce_h5)
 ```
 
-### from tenx formats
+### From 10X formats
 
-In addition, we also provide convenient methods to load a 10X H5 file. We currently only support version 3 of the 10X H5 format.
+In addition, we also provide convenient methods to load a [10X Genomics HDF5 Feature-Barcode Matrix Format](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/outputs/cr-outputs-h5-matrices) file.
 
 ```{python}
 from singlecellexperiment import read_tenx_h5
@@ -145,7 +157,7 @@ Methods are also available to read a 10x matrix market directory using the `read
 
 Getters are available to access various attributes using either the property notation or functional style. 
 
-Since `SingleCellExperiment` extends `RangedSummarizedExperiment`, all getters and setters from the base class are accessible here; more details [here](./summarized_expt.qmd).
+Since `SingleCellExperiment` extends `RangedSummarizedExperiment`, all getters and setters from the base class are accessible here; more details [here](./summarized_experiment.qmd).
 
 ```{python}
 # access assay names
@@ -176,14 +188,14 @@ print(subset_sce)
 ```
 
 
-## Combining experiments {#sec-sce-combine}
+## Combining experiments
 
-`SingleCellExperiment` implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils).
+`SingleCellExperiment` implements methods for the `combine` generic from [**BiocUtils**](https://github.com/BiocPy/biocutils).
 
 These methods enable the merging or combining of multiple `SingleCellExperiment` objects, allowing users to aggregate data from different experiments or conditions. Note: `row_pairs` and `column_pairs` are not ignored as part of this operation.
 
 
-To demonstrate, let's create multiple `SingleCellExperiment` objects (read more about this in [combine section from `SummarizedExperiment`](./summarized_expt.qmd#combining-experiments)).
+To demonstrate, let's create multiple `SingleCellExperiment` objects (read more about this in [combine section from `SummarizedExperiment`](./summarized_experiment.qmd#combining-experiments)).
 
 ```{python}
 #| code-fold: true
@@ -289,7 +301,9 @@ sce_combined = combine_columns(sce2, sce1)
 print(sce_combined)
 ```
 
+:::{.callout-note}
 You can use `relaxed_combine_columns` or `relaxed_combined_rows` when there's mismatch in the number of features or samples. Missing rows or columns in any object are filled in with appropriate placeholder values before combining, e.g. missing assay's are replaced with a masked numpy array.
+:::
 
 ```{python}
 # sce_alts1 contains an additional assay not present in sce_alts2
@@ -298,7 +312,7 @@ print(sce_relaxed_combine)
 ```
 
 
-## Export as `MuData`
+## Export as `AnnData` or `MuData`
 
 The package also provides methods to convert a `SingleCellExperiment` object into a `MuData` representation:
 
@@ -307,6 +321,14 @@ mdata = sce.to_mudata()
 mdata
 ```
 
+or coerce to an `AnnData` object:
+
+```{python}
+adata, alts = sce.to_anndata()
+print("main experiment: ", adata)
+print("alternative experiments: ", alts)
+```
+
 ----
 
 ## Further reading

diff --git a/chapters/experiments/summarized_expt.qmd → ...ers/experiments/summarized_experiment.qmd b/chapters/experiments/summarized_expt.qmd → ...ers/experiments/summarized_experiment.qmd
@@ -2,12 +2,16 @@
 
 This package provides containers to represent genomic experimental data as 2-dimensional matrices. In these matrices, the rows typically denote features or genomic regions of interest, while columns represent samples or cells.
 
-The package currently includes representations for both `SummarizedExperiment` and `RangedSummarizedExperiment`. A distinction lies in the fact that the rows of a `RangedSummarizedExperiment` object are expected to be `GenomicRanges` (tutorial [here](../representations/genomicranges.qmd)), representing genomic regions of interest.
+The package currently includes representations for both `SummarizedExperiment` and `RangedSummarizedExperiment`. A distinction lies in the fact that the rows of a `RangedSummarizedExperiment` object are expected to be `GenomicRanges` (tutorial [here](../representations/genomic_ranges.qmd)), representing genomic regions of interest.
 
 :::{.callout-important}
 The design of `SummarizedExperiment` class and its derivates adheres to the R/Bioconductor specification, where rows correspond to features, and columns represent samples or cells.
 :::
 
+:::{.callout-note}
+These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](../philosophy.qmd#functional-discipline) section.
+:::
+
 ## Installation
 
 To get started, install the package from [PyPI](https://pypi.org/project/summarizedexperiment/)
@@ -24,14 +28,18 @@ A `SummarizedExperiment` contains three key attributes,
 - `row_data`: Feature information e.g. genes, transcripts, exons, etc.
 - `column_data`: Sample information about the columns of the matrices.
 
+:::{.callout-important}
+Both `row_data` and `column_data` are expected to be [BiocFrame](../representations/biocframe.qmd) objects and will be coerced to a `BiocFrame` for consistent downstream operations.
+:::
+
 In addition, these classes can optionally accept `row_names` and `column_names`. Since `row_data` and `column_data` may also contain names, the following rules are used in the implementation:
 
 - On **construction**, if `row_names` or `column_names` are not provided, these are automatically inferred from `row_data` and `column_data` objects.
-- On **extraction** of these objects, the `row_names` in `row_data` and `column_data` are replaced by the equivalents from the SE level.
+- On **accessors** of these objects, the `row_names` in `row_data` and `column_data` are replaced by the equivalents from the SE level.
 - On **setters** for these attributes, especially with the functional style (`set_row_data` and `set_column_data` methods), additional options are available to replace the names in the SE object.
 
-:::{.callout-note}
-This avoids unexpected mdifications in names, when either `row_data` or `column_data` objects are modified.
+:::{.callout-caution}
+These rules help avoid unexpected mdifications in names, when either `row_data` or `column_data` objects are modified.
 :::
 
 To construct a `SummarizedExperiment`, we'll first generate a matrix of read counts, representing the read counts from a series of RNA-seq experiments. Following that, we'll create a `BiocFrame` object to denote feature information and a table for column annotations. This table may include the names for the columns and any other values we wish to represent.
@@ -80,7 +88,7 @@ col_data = pd.DataFrame(
 ```
 
 :::{.callout-note}
-The inputs `row_data` and `column_data` are expected to be `BiocFrame` objects and will be coerced to a `BiocFrame` if a pandas `DataFrame` is supplied.  
+The inputs `row_data` and `column_data` are expected to be `BiocFrame` objects and will be coerced to a `BiocFrame` if a pandas `DataFrame` is supplied.
 :::
 
 Now, we can construct a `SummarizedExperiment` from this information.
@@ -110,9 +118,9 @@ rse = RangedSummarizedExperiment(
 print(rse)
 ```
 
-## Delayed arrays
+## Delayed or file-backed arrays
 
-The general idea is that `DelayedArray`s are a drop-in replacement for NumPy arrays, at least for [BiocPy](https://github.com/BiocPy) applications. Learn more about [delayed arrays here](../extras/delayedarrays.qmd).
+The general idea is that `DelayedArray`'s are a drop-in replacement for NumPy arrays, at least for [BiocPy](https://github.com/BiocPy) applications. Learn more about [delayed arrays here](../representations/delayed_arrays.qmd).
 
 For example, we can use the `DelayedArray` inside a `SummarizedExperiment`:
 
@@ -145,7 +153,7 @@ print(adata)
 ```
 
 :::{.callout-tip}
-To convert an `AnnData` object to a BiocPy representation, utilize the `from_anndata` method in the [SingleCellExperiment](./singlecell_expt.qmd) class. This minimizes the loss of information when converting between these two representations.
+To convert an `AnnData` object to a BiocPy representation, utilize the `from_anndata` method in the [SingleCellExperiment](./single_cell_experiment.qmd) class. This minimizes the loss of information when converting between these two representations.
 :::
 
 ## Getters/Setters
@@ -246,7 +254,11 @@ An `Exception` is raised if a names does not exist.
 
 ### Subset by boolean vector
 
-Similarly, you can also slice by a boolean array. Note that the boolean vectors should contain the same number of features for the row slice and the same number of samples for the column slice.
+Similarly, you can also slice by a boolean array. 
+
+:::{.callout-important}
+Note that the boolean vectors should contain the same number of features for the row slice and the same number of samples for the column slice.
+:::
 
 ```{python}
 subset_se_with_bools = se_with_names[[True, True, False], [True, False, True]]
@@ -257,6 +269,10 @@ print(subset_se_with_bools)
 
 This is a feature not a bug :), you can specify an empty list to completely remove all rows or samples.
 
+:::{.callout-warning}
+An empty array (`[]`) is not the same as an empty slice (`:`). This helps us avoid unintented operations.
+:::
+
 ```{python}
 subset = se_with_names[:2, []]
 print(subset)
@@ -265,9 +281,9 @@ print(subset)
 
 ## Range-based operations
 
-Additionally, since `RangeSummarizedExperiment` contain `row_ranges`, this allows us to perform a number of range based operations that are possible on a `GenomicRanges` object.
+Additionally, since `RangeSummarizedExperiment` contains `row_ranges`, this allows us to perform a number of range-based operations that are possible on a `GenomicRanges` object.
 
-For example, to subset `RangeSummarizedExperiment` with a query set of regions:
+For example, to subset `RangeSummarizedExperiment` with a **query** set of regions:
 
 ```{python}
 from iranges import IRanges
@@ -279,9 +295,9 @@ print(result)
 
 Additionally, RSE supports many other interval based operations. Checkout the [documentation](https://biocpy.github.io/SummarizedExperiment/api/modules.html) for more details.
 
-## Combining experiments {#sec-se-combine}
+## Combining experiments
 
-`SummarizedExperiment` implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils).
+`SummarizedExperiment` implements methods for the `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils).
 
 These methods enable the merging or combining of multiple `SummarizedExperiment` objects, allowing users to aggregate data from different experiments or conditions. To demonstrate, let's create multiple `SummarizedExperiment` objects.
 
@@ -371,7 +387,11 @@ print(se2)
 print(se3)
 ```
 
-The `combine_rows` or `combine_columns` operations, expect all experiments to contain the same assay names. To combine experiments by row:
+:::{.callout-important}
+The `combine_rows` or `combine_columns` operations, expect all experiments to contain the same assay names. 
+:::
+
+To combine experiments by row:
 
 ```{python}
 from biocutils import relaxed_combine_columns, combine_columns, combine_rows, relaxed_combine_rows
@@ -386,7 +406,9 @@ se_combined = combine_columns(se2, se1)
 print(se_combined)
 ```
 
+:::{.callout-important}
 You can use `relaxed_combine_columns` or `relaxed_combined_rows` when there's mismatch in the number of features or samples. Missing rows or columns in any object are filled in with appropriate placeholder values before combining, e.g. missing assay's are replaced with a masked numpy array.
+:::
 
 ```{python}
 # se3 contains an additional assay not present in se1