ArtifactDB · LTLA · Feb 2, 2024 · Feb 1, 2024 · Feb 1, 2024 · Feb 1, 2024
diff --git a/README.md b/README.md
@@ -15,8 +15,370 @@
 
 # Save and load Bioconductor objects in Python
 
-The **dolomite-base** package is the Python counterpart to the [**alabaster.base**](https://github.com/ArtifactDB/alabaster.base) R package,
-defining the basic generics for language-agnostic reading and writing of Bioconductor objects with their associated metadata.
-Implementations of methods for these generics can be found in the other **dolomite-** packages.
+The **dolomite-base** package is the Python counterpart to the [**alabaster.base**](https://github.com/ArtifactDB/alabaster.base) R package
+for language-agnostic reading and writing of Bioconductor objects (see the [**BiocPy**](https://github.com/BiocPy) project).
+This is a more robust and portable alternative to the typical approach of pickling Python objects to save them to disk.
 
-Check out the [API documentation](https://artifactdb.github.io/dolomite-base/) for more details.
+- By separating the on-disk representation from the in-memory object structure, we can more easily adapt to changes in class definitions.
+  This improves robustness to Python environment updates.
+- By using standard file formats like HDF5 and CSV, we ensure that the objects can be easily read from other languages like R and Javascript.
+  This improves interoperability between application ecosystems.
+- By breaking up complex Bioconductor objects into their components, we enable modular reads and writes to the backing store.
+  We can easily read or update part of an object without having to consider the other parts.
+
+The **dolomite-base** package defines the base generics to read and write the file structures along with the associated metadata.
+Implementations of these methods for various Bioconductor classes can be found in the other **dolomite** packages like
+[**dolomite-ranges**](https://github.com/ArtifactDB/dolomite-ranges) and [**dolomite-se**](https://github.com/ArtifactDB/dolomite-se).
+
+## Quick start
+
+First, we'll install the **dolomite-base** package.
+This package is available from [PyPI](https://pypi.org/project/dolomite-base) so we can use the standard installation process:
+
+```sh
+pip install dolomite-base
+```
+
+The simplest example involves saving a [`BiocFrame`](https://github.com/BiocPy/BiocFrame) inside a staging directory.
+Let's mock one up:
+
+```python
+import biocframe
+df = biocframe.BiocFrame({
+    "X": list(range(0, 10)),
+    "Y": [ "a", "b", "c", "d", "e", "f", "g", "h", "i", "j" ]
+})
+print(df)
+## BiocFrame with 10 rows and 2 columns
+##           X      Y
+##     <range> <list>
+## [0]       0      a
+## [1]       1      b
+## [2]       2      c
+## [3]       3      d
+## [4]       4      e
+## [5]       5      f
+## [6]       6      g
+## [7]       7      h
+## [8]       8      i
+## [9]       9      j
+```
+
+We save our `BiocFrame` to a user-specified directory with the `save_object()` function.
+This function saves its input object to file according to the relevant [specification](https://github.com/ArtifactDB/takane).
+
+```python
+import tempfile
+import os
+tmp = tempfile.mkdtemp()
+
+import dolomite_base
+path = os.path.join(tmp, "my_df")
+dolomite_base.save_object(df, path)
+
+os.listdir(path)
+## ['basic_columns.h5', 'OBJECT']
+```
+
+We load the contents of the directory back into a Python session by using the `read_object()` function.
+Note that the exact Python types for the `BiocFrame` columns may not be preserved by the round trip,
+though the contents of the columns will be unchanged.
+
+```python
+out = dolomite_base.read_object(path)
+print(out)
+BiocFrame with 10 rows and 2 columns
+##                    X            Y
+##     <ndarray[int32]> <StringList>
+## [0]                0            a
+## [1]                1            b
+## [2]                2            c
+## [3]                3            d
+## [4]                4            e
+## [5]                5            f
+## [6]                6            g
+## [7]                7            h
+## [8]                8            i
+## [9]                9            j
+```
+
+Check out the [API reference](https://artifactdb.github.io/dolomite-base/api/modules.html) for more details.
+
+## Supported classes
+
+The saving/reading process can be applied to a range of [**BiocPy**](https://github.com/BiocPy) data structures,
+provided the appropriate **dolomite** package is installed.
+Each package implements a saving and reading function for its associated classes,
+which are automatically used from **dolomite-base**'s `save_object()` and `read_object()` functions, respectively.
+(That is, there is no need to explicitly `import` the package when calling `save_object()` or `read_object()` for its classes.)
+
+| Package | Object types | PyPI |
+|-----|-----|----|
+| [**dolomite-base**](https://github.com/ArtifactDB/dolomite-base) | [`BiocFrame`](https://github.com/BiocPy/BiocFrame), `list`, `dict`, [`NamedList`](https://github.com/BiocPy/BiocUtils) | [![](https://img.shields.io/pypi/v/dolomite-base.svg)](https://pypi.org/project/dolomite-base/) |
+| [**dolomite-matrix**](https://github.com/ArtifactDB/dolomite-matrix) | `numpy.ndarray`, `scipy.sparse.spmatrix`, [`DelayedArray`](https://github.com/BiocPy/DelayedArray) | [![](https://img.shields.io/pypi/v/dolomite-matrix.svg)](https://pypi.org/project/dolomite-matrix/) |
+| [**dolomite-ranges**](https://github.com/ArtifactDB/dolomite-ranges) | [`GenomicRanges`](https://github.com/BiocPy/GenomicRanges), `GenomicRangesList` | [![](https://img.shields.io/pypi/v/dolomite-ranges.svg)](https://pypi.org/project/dolomite-ranges/) |
+| [**dolomite-se**](https://github.com/ArtifactDB/dolomite-se) | [`SummarizedExperiment`](https://github.com/BiocPy/SummarizedExperiment), `RangedSummarizedExperiment` | [![](https://img.shields.io/pypi/v/dolomite-se.svg)](https://pypi.org/project/dolomite-se/) |
+| [**dolomite-sce**](https://github.com/ArtifactDB/dolomite-sce) | [`SingleCellExperiment`](https://github.com/BiocPy/SingleCellExperiment) | [![](https://img.shields.io/pypi/v/dolomite-sce.svg)](https://pypi.org/project/dolomite-sce/) |
+| [**dolomite-mae**](https://github.com/ArtifactDB/dolomite-mae) | [`MultiAssayExperiment`](https://bioconductor.org/packages/MultiAssayExperiment) | [![](https://img.shields.io/pypi/v/dolomite-mae.svg)](https://pypi.org/project/dolomite-mae/) |
+
+Each class's on-disk representation is determined by the associated [**takane** specification](https://github.com/ArtifactDB/takane).
+For more complex objects, the on-disk representation may consist of multiple files, or even subdirectories containing "child" objects from internal `save_object()` calls.
+Each call to `save_object()` will automatically enforce the relevant specification by validating the directory contents with **dolomite-base**'s `validate_object()` function.
+This provides some guarantees on the file structure within the directory, allowing developers to reliably implement readers in a variety of frameworks -
+for example, the [**alabaster**](https://github.com/ArtifactDB/alabaster.base) will run the same validators on its directory contents to guarantee interoperability.
+
+All of the listed packages are available from PyPI and can be installed with the usual `pip install` procedure.
+Alternatively, to install all packages in one go, users can install the [**dolomite**](https://pypi.org/project/dolomite) umbrella package.
+
+## Operating on directories
+
+Users can move freely rename or relocate directories and `read_object()` function will still work.
+For example, we can easily copy the entire directory to a new file system and everything will still be correctly referenced within the directory.
+The simplest way to share objects is to just `zip` or `tar` the staging directory for _ad hoc_ distribution.
+For more serious applications, `r self` can be used in conjunction with storage systems like AWS S3 for easier distribution.
+
+```python
+# Mocking up an object:
+import biocframe
+df = biocframe.BiocFrame({
+    "X": list(range(0, 10)),
+    "Y": [ "a", "b", "c", "d", "e", "f", "g", "h", "i", "j" ]
+})
+
+# Saving to one location:
+import tempfile
+import os
+import dolomite_base
+tmp = tempfile.mkdtemp()
+path = os.path.join(tmp, "my_df")
+dolomite_base.save_object(df, path)
+
+# Reading from another location:
+alt_path = os.path.join(tmp, "foobar")
+os.rename(path, alt_path)
+alt_out = dolomite_base.read_object(alt_path)
+```
+
+That said, it is unwise to manipulate the files inside the directory created by `save_object()`.
+Reading functions will usually depend on specific file names or subdirectory structures within the directory, and fiddling with them may cause unexpected results.
+Advanced users can exploit this by loading components from subdirectories if the full object is not required:
+
+```python
+# Creating a nested DF:
+nested = biocframe.BiocFrame({ "A": df })
+nest_path = os.path.join(tmp, "nesting")
+dolomite_base.save_object(nested, nest_path)
+
+# Now reading in the nested DF:
+redf = dolomite_base.read_object(os.path.join(nest_path, "other_columns", "0"))
+```
+
+## Extending to new classes
+
+The _dolomite_ framework is easily extended to new classes by:
+
+1. Writing a method for `save_object()`.
+   This should accept an instance of the object and a path to a directory, and save the contents of the object inside the directory.
+   It should also produce an `OBJECT` file that specifies the type of the object, e.g., `data_frame`, `hdf5_sparse_matrix`.
+2. Writing a function for `read_object()` and registering it in the `read_object_registry`.
+   This should accept a path to a directory and read its contents to reconstruct the object.
+   The registered type should be the same as that used in the `OBJECT` file.
+3. Writing a function for `validate_object()` and registering it in the `validate_object_registry`.
+   This should accept a path to a directory and read its contents to determine if it is a valid on-disk representation.
+   The registered type should be the same as that used in the `OBJECT` file.
+   - (optional) Devleopers can alternatively formalize the on-disk representation by adding a specification to the [**takane**](https://github.com/ArtifactDB/takane) repository.
+     This aims to provide C++-based validators for each representation, allowing us to enforce consistency across multiple languages (e.g., R).
+     Any **takane** validator is automatically used by `validate_object()` so no registration is required.
+
+To illustrate, let's extend _dolomite_ to a new custom class:
+
+```python
+class Coffee:
+    def __init__(self, beans: str, milk: bool):
+        self.beans = beans
+        self.milk = milk
+```
+
+First we implement the saving method.
+Note that we add a `@validate_saves` decorator to instruct `save_object()` to automatically run `validate_object()` on the generated directory, to confirm that the output is valid.
+
+```python
+import dolomite_base
+import os
+import json
+
+@dolomite_base.save_object.register
+@dolomite_base.validate_saves
+def save_object_for_Coffee(x: Coffee, path: str, **kwargs):
+    os.mkdir(path)
+    with open(os.path.join(path, "bean_type"), "w") as handle:
+        handle.write(x.beans)
+    with open(os.path.join(path, "has_milk"), "w") as handle:
+        handle.write("true" if x.milk else "false")
+    with open(os.path.join(path, "OBJECT"), "w") as handle:
+        json.dump({ "type": "coffee", "coffee": { "version": "0.1" } }, handle)
+```
+
+Then the reading method:
+
+```python
+from typing import Dict
+
+def read_Coffee(path: str, metadata: Dict, **kwargs) -> Coffee:
+    metadata["coffee"]["version"] # possibly do something different based on version
+    with open(os.path.join(path, "bean_type"), "r") as handle:
+        beans = handle.read()
+    with open(os.path.join(path, "has_milk"), "r") as handle:
+        milk = (handle.read() == "true")
+    return Coffee(beans, milk)
+
+dolomite_base.read_object_registry["coffee"] = read_Coffee
+```
+
+And finally, the validation method:
+
+```python
+def validate_Coffee(path: str, metadata: Dict):
+    metadata["coffee"]["version"] # possibly do something different based on version
+    with open(os.path.join(path, "bean_type"), "r") as handle:
+        beans = handle.read()
+        if not beans in [ "arabica", "robusta", "excelsa", "liberica" ]:
+            raise ValueError("wrong bean type '" + beans + "'")
+    with open(os.path.join(path, "has_milk"), "r") as handle:
+        milk = handle.read()
+        if not milk in [ "true", "false" ]:
+            raise ValueError("invalid milk '" + milk + "'")
+
+dolomite_base.validate_object_registry["coffee"] = validate_Coffee
+```
+
+Let's run them and see how it works:
+
+```{r}
+cup = Coffee("arabica", milk=False)
+
+import tempfile
+tmp = tempfile.mkdtemp()
+path = os.path.join(tmp, "stuff")
+dolomite_base.save_object(cup, path)
+
+cup2 = dolomite_base.read_object(path)
+print(cup2.beans)
+## arabica
+```
+
+For more complex objects that are composed of multiple smaller "child" objects, developers should consider saving each of their children in subdirectories of `path`.
+This can be achieved by calling `alt_save_object()` and `alt_read_object()` in the saving and loading functions, respectively.
+(We use the `alt_*` versions of these functions to respect application overrides, see below.)
+
+# Creating applications
+
+Developers can also create applications that customize the machinery of the _dolomite_ framework for specific needs.
+In most cases, this involves storing more metadata to describe the object in more detail.
+For example, we might want to remember the identity of the author for each object.
+This is achieved by creating an application-specific saving generic with the same signature as `save_object()`:
+
+```python
+from functools import singledispatch
+from typing import Any, Dict, Optional
+import dolomite_base
+import json
+import os
+import getpass
+import biocframe
+
+def dump_extra_metadata(path: str, extra: Dict):
+    user_id = getpass.getuser()
+    # File names with leading underscores are reserved for application-specific
+    # use, so they won't clash with anything produced by save_object().
+    metapath = os.path.join(path, "_metadata.json")
+    with open(metapath, "w") as handle:
+        json.dump({ **extra, "author": user_id }, handle)
+
+@singledispatch
+def app_save_object(x: Any, path: str, **kwargs):
+    dolomite_base.save_object(x, path, **kwargs) # does the real work
+    dump_extra_metadata(path, {}) # adding some application-specific metadata
+
+@app_save_object.register
+def app_save_object_for_BiocFrame(x: biocframe.BiocFrame, path: str, **kwargs):
+    dolomite_base.save_object(x, path, **kwargs) # does the real work
+    # We can also override specific methods to add object+application-specific metadata:
+    dump_extra_metadata(path, { "columns": x.get_column_names().as_list() })
+```
+
+Applications should call `alt_save_object_function()` to instruct `alt_save_object()` to use this new generic.
+This ensures that the customizations are applied to all child objects, such as the nested `BiocFrame` below.
+
+```python
+# Create a friendly user-visible function to handle the generic override; this
+# is reversed on function exit to avoid interfering with other applications.
+def save_for_application(x, path: str, **kwargs):
+    old = dolomite_base.alt_save_object_function(app_save_object)
+    try:
+        dolomite_base.alt_save_object(x, path, **kwargs)
+    finally:
+        dolomite_base.alt_save_object_function(old)
+
+# Saving our nested BiocFrames with our overrides active.
+import biocframe
+df = biocframe.BiocFrame({
+    "A": [1, 2, 3, 4],
+    "B": biocframe.BiocFrame({
+        "C": ["a", "b", "c", "d"]
+    })
+})
+
+import tempfile
+tmp = tempfile.mkdtemp()
+path = os.path.join(tmp, "foobar")
+save_for_application(df, path)
+
+# Both the parent and child BiocFrames have new metadata.
+with open(os.path.join(path, "_metadata.json"), "r") as handle:
+    print(handle.read())
+## {"columns": ["A", "B"], "author": "aaron"}
+
+with open(os.path.join(path, "other_columns", "1", "_metadata.json"), "r") as handle:
+    print(handle.read())
+## {"columns": ["C"], "author": "aaron"}
+```
+
+The reading function can be similarly overridden by setting `alt_read_object_function()` to instruct all `alt_read_object()` calls to use the override.
+This allows applications to, e.g., do something with the metadata that we just added.
+
+```python
+def app_read_object(path: str, metadata: Optional[Dict] = None, **kwargs):
+    if metadata is None:
+        with open(os.path.join(path, "OBJECT"), "r") as handle:
+            metadata = json.load(handle)
+
+    # Print custom message based on the type and application-specific metadata.
+    with open(os.path.join(path, "_metadata.json"), "r") as handle:
+        appmeta = json.load(handle)
+        print("I am a " + metadata["type"] + " created by " + appmeta["author"])
+        if metadata["type"] == "data_frame":
+            print("I have the following columns: " + ", ".join(appmeta["columns"]))
+
+    return dolomite_base.read_object(path, metadata=metadata, **kwargs)
+
+# Creating a user-friendly function to set the override before the read.
+def read_for_application(path: str, metadata: Optional[Dict] = None, **kwargs):
+    old = dolomite_base.alt_read_object_function(app_read_object)
+    try:
+        return dolomite_base.alt_read_object(path, metadata=metadata, **kwargs)
+    finally:
+        dolomite_base.alt_read_object_function(old)
+
+# This diverts to the override with printing of custom messages.
+read_for_application(path)
+## I am a data_frame created by aaron
+## I have the following columns: A, B
+## I am a data_frame created by aaron
+## I have the following columns: C
+```
+
+By overriding the saving and reading process for one or more classes, each application can customize the behavior of _dolomite_ to their own needs.
+In general, applications should avoid modifying the files created by `save_object()`, to avoid violating any **takane** format specifications
+(unless the application maintainer really knows what they're doing).
+Applications are free to write to any path starting with an underscore as this will not be used by any specification.