From 5b4cb4f49908738864d63843879c8c15ef046931 Mon Sep 17 00:00:00 2001 From: Jayaram Kancherla Date: Mon, 10 Jun 2024 16:56:06 -0700 Subject: [PATCH] Execute code snippets in markdown (#109) Use myst-nb to execute code snippets. Add section on polars integration. --- docs/conf.py | 12 +- docs/index.md | 2 +- docs/tutorial.ipynb | 640 -------------------------------------------- docs/tutorial.md | 408 ++++++++++++++++++++++++++++ 4 files changed, 413 insertions(+), 649 deletions(-) delete mode 100644 docs/tutorial.ipynb create mode 100644 docs/tutorial.md diff --git a/docs/conf.py b/docs/conf.py index 1051346..6458dd1 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -72,7 +72,6 @@ "sphinx.ext.ifconfig", "sphinx.ext.mathjax", "sphinx.ext.napoleon", - "sphinx_autodoc_typehints", ] # Add any paths that contain templates here, relative to this directory. @@ -119,13 +118,10 @@ # If you don’t need the separation provided between version and release, # just set them both to the same value. -if "SPHINX_DOC_VERSION_MANUAL" in os.environ: - version = os.getenv("SPHINX_DOC_VERSION_MANUAL", None) -else: - try: - from biocframe import __version__ as version - except ImportError: - version = "" +try: + from biocframe import __version__ as version +except ImportError: + version = "" if not version or version.lower() == "unknown": version = os.getenv("READTHEDOCS_VERSION", "unknown") # automatically set by RTD diff --git a/docs/index.md b/docs/index.md index 6e56231..098d35b 100644 --- a/docs/index.md +++ b/docs/index.md @@ -5,7 +5,7 @@ ```{toctree} :maxdepth: 2 -Overview +Overview Module Reference Contributions & Help License diff --git a/docs/tutorial.ipynb b/docs/tutorial.ipynb deleted file mode 100644 index fe7b7f0..0000000 --- a/docs/tutorial.ipynb +++ /dev/null @@ -1,640 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# `BiocFrame` - Bioconductor-like data frames\n", - "\n", - "`BiocFrame` class is a Bioconductor-friendly data frame class. Its primary advantage lies in not making assumptions about the types of the columns - as long as an object has a length (`__len__`) and supports slicing methods (`__getitem__`), it can be used inside a `BiocFrame`. \n", - "\n", - "This flexibility allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects. Also check out Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package, which implements the `DFrame` class on which `BiocFrame` was based.\n", - "\n", - "These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](https://biocpy.github.io/tutorial/chapters/philosophy.html) section.\n", - "\n", - "## Advantages of `BiocFrame`\n", - "\n", - "One of the core principles guiding the implementation of the `BiocFrame` class is \"**_what you put is what you get_**\". Unlike Pandas `DataFrame`, `BiocFrame` makes no assumptions about the types of the columns provided as input. Some key differences to highlight the advantages of using `BiocFrame` are especially in terms of modifications to column types and handling nested dataframes.\n", - "\n", - "### Inadvertent modification of types\n", - "\n", - "As an example, Pandas `DataFrame` modifies the types of the input data. These assumptions may cause issues when interoperating between R and Python." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "from array import array\n", - "\n", - "df = pd.DataFrame({\n", - " \"numpy_vec\": np.zeros(10),\n", - " \"list_vec\": [1]* 10,\n", - " \"native_array_vec\": array('d', [3.14] * 10) # less used but native python arrays\n", - "})\n", - "\n", - "print(\"type of numpy_vector column:\", type(df[\"numpy_vec\"]), df[\"numpy_vec\"].dtype)\n", - "print(\"type of list_vector column:\", type(df[\"list_vec\"]), df[\"list_vec\"].dtype)\n", - "print(\"type of native_array_vector column:\", type(df[\"native_array_vec\"]), df[\"native_array_vec\"].dtype)\n", - "\n", - "print(df)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With `BiocFrame`, no assumptions are made, and the input data is not cast into (un)expected types:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from biocframe import BiocFrame\n", - "import numpy as np\n", - "from array import array\n", - "\n", - "bframe_types = BiocFrame({\n", - " \"numpy_vec\": np.zeros(10),\n", - " \"list_vec\": [1]* 10,\n", - " \"native_array_vec\": array('d', [3.14] * 10)\n", - "})\n", - "\n", - "print(\"type of numpy_vector column:\", type(bframe_types[\"numpy_vec\"]))\n", - "print(\"type of list_vector column:\", type(bframe_types[\"list_vec\"]))\n", - "print(\"type of native_array_vector column:\", type(bframe_types[\"native_array_vec\"]))\n", - "print(bframe_types)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This behavior remains consistent when extracting, slicing, combining, or performing any supported operation on `BiocFrame` objects.\n", - "\n", - "### Handling complex nested frames\n", - "\n", - "Pandas `DataFrame` does not support nested structures; therefore, running the snippet below will result in an error:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "df = pd.DataFrame({\n", - " \"ensembl\": [\"ENS00001\", \"ENS00002\", \"ENS00002\"],\n", - " \"symbol\": [\"MAP1A\", \"BIN1\", \"ESR1\"],\n", - " \"ranges\": pd.DataFrame({\n", - " \"chr\": [\"chr1\", \"chr2\", \"chr3\"],\n", - " \"start\": [1000, 1100, 5000],\n", - " \"end\": [1100, 4000, 5500]\n", - " }),\n", - "})\n", - "print(df)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "However, it is handled seamlessly with `BiocFrame`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "bframe_nested = BiocFrame({\n", - " \"ensembl\": [\"ENS00001\", \"ENS00002\", \"ENS00002\"],\n", - " \"symbol\": [\"MAP1A\", \"BIN1\", \"ESR1\"],\n", - " \"ranges\": BiocFrame({\n", - " \"chr\": [\"chr1\", \"chr2\", \"chr3\"],\n", - " \"start\": [1000, 1100, 5000],\n", - " \"end\": [1100, 4000, 5500]\n", - " }),\n", - "})\n", - "\n", - "print(bframe_nested)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This behavior remains consistent when extracting, slicing, combining, or performing any other supported operations on `BiocFrame` objects.\n", - "\n", - "## Construction\n", - "\n", - "Creating a `BiocFrame` object is straightforward; just provide the `data` as a dictionary." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from biocframe import BiocFrame\n", - "\n", - "obj = {\n", - " \"ensembl\": [\"ENS00001\", \"ENS00002\", \"ENS00003\"],\n", - " \"symbol\": [\"MAP1A\", \"BIN1\", \"ESR1\"],\n", - "}\n", - "bframe = BiocFrame(obj)\n", - "print(bframe)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can specify complex objects as columns, as long as they have some \"length\" equal to the number of rows.\n", - "\n", - "For example, we can embed a `BiocFrame` within another `BiocFrame`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "obj = {\n", - " \"ensembl\": [\"ENS00001\", \"ENS00002\", \"ENS00002\"],\n", - " \"symbol\": [\"MAP1A\", \"BIN1\", \"ESR1\"],\n", - " \"ranges\": BiocFrame({\n", - " \"chr\": [\"chr1\", \"chr2\", \"chr3\"],\n", - " \"start\": [1000, 1100, 5000],\n", - " \"end\": [1100, 4000, 5500]\n", - " }),\n", - "}\n", - "\n", - "bframe2 = BiocFrame(obj, row_names=[\"row1\", \"row2\", \"row3\"])\n", - "print(bframe2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `row_names` parameter is analogous to index in the pandas world and should not contain missing strings. Additionally, you may provide:\n", - "\n", - "- `column_data`: A `BiocFrame`object containing metadata about the columns. This must have the same number of rows as the numbers of columns.\n", - "- `metadata`: Additional metadata about the object, usually a dictionary. \n", - "- `column_names`: If different from the keys in the `data`. If not provided, this is automatically extracted from the keys in the `data`.\n", - "\n", - "### Interop with pandas\n", - "\n", - "`BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R, many users may prefer working with **pandas** `DataFrame` objects for their actual analyses. This conversion is easily achieved:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from biocframe import BiocFrame\n", - "bframe3 = BiocFrame(\n", - " {\n", - " \"foo\": [\"A\", \"B\", \"C\", \"D\", \"E\"],\n", - " \"bar\": [True, False, True, False, True]\n", - " }\n", - ")\n", - "\n", - "df = bframe3.to_pandas()\n", - "print(type(df))\n", - "print(df)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Converting back to a `BiocFrame` is similarly straightforward:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "out = BiocFrame.from_pandas(df)\n", - "print(out)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Extracting data\n", - "\n", - "BiocPy classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](../philosophy.qmd#functional-discipline) section.\n", - "\n", - "Properties can be directly accessed from the object:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"shape:\", bframe.shape)\n", - "print(\"column names (functional style):\", bframe.get_column_names())\n", - "print(\"column names (as property):\", bframe.column_names) # same as above" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can fetch individual columns:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"functional style:\", bframe.get_column(\"ensembl\"))\n", - "print(\"w/ accessor\", bframe[\"ensembl\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And we can get individual rows as a dictionary:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "bframe.get_row(2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.callout}\n", - "To retrieve a subset of the data in the `BiocFrame`, we use the subset (`[]`) operator.\n", - "This operator accepts different subsetting arguments, such as a boolean vector, a `slice` object, a sequence of indices, or row/column names.\n", - ":::" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sliced_with_bools = bframe[1:2, [True, False, False]]\n", - "print(\"Subset using booleans: \\n\", sliced_with_bools)\n", - "\n", - "sliced_with_names = bframe[[0,2], [\"symbol\", \"ensembl\"]]\n", - "print(\"\\nSubset using column names: \\n\", sliced_with_names)\n", - "\n", - "# Short-hand to get a single column:\n", - "print(\"\\nShort-hand to get a single column: \\n\", bframe[\"ensembl\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting data\n", - "\n", - "### Preferred approach\n", - "\n", - "For setting properties, we encourage a **functional style** of programming to avoid mutating the object directly. This helps prevent inadvertent modifications of `BiocFrame` instances within larger data structures.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "modified = bframe.set_column_names([\"column1\", \"column2\"])\n", - "print(modified)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's check the column names of the original object," - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Original is unchanged:\n", - "print(bframe.get_column_names())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To add new columns, or replace existing ones:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "modified = bframe.set_column(\"symbol\", [\"A\", \"B\", \"C\"])\n", - "print(modified)\n", - "\n", - "modified = bframe.set_column(\"new_col_name\", range(2, 5))\n", - "print(modified)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Change the row or column names:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "modified = bframe.\\\n", - " set_column_names([\"FOO\", \"BAR\"]).\\\n", - " set_row_names(['alpha', 'bravo', 'charlie'])\n", - "print(modified)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "***The functional style allows you to chain multiple operations.***\n", - "\n", - "We also support Bioconductor's metadata concepts, either along the columns or for the entire object:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "modified = bframe.\\\n", - " set_metadata({ \"author\": \"Jayaram Kancherla\" }).\\\n", - " set_column_data(BiocFrame({\"column_source\": [\"Ensembl\", \"HGNC\" ]}))\n", - "print(modified)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### The not-preferred-way\n", - "\n", - "Properties can also be set by direct assignment for in-place modification. We prefer not to do it this way as it can silently mutate ``BiocFrame`` instances inside other data structures.\n", - "Nonetheless:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "testframe = BiocFrame({ \"A\": [1,2,3], \"B\": [4,5,6] })\n", - "testframe.column_names = [\"column1\", \"column2\" ]\n", - "print(testframe)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Warnings are raised when properties are directly mutated. These assignments are the same as calling the corresponding `set_*()` methods with `in_place = True`.\n", - "It is best to do this only if the `BiocFrame` object is not being used anywhere else;\n", - "otherwise, it is safer to just create a (shallow) copy via the default `in_place = False`.\n", - "\n", - "Similarly, we could set or replace columns directly:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "testframe[\"column2\"] = [\"A\", \"B\", \"C\"]\n", - "testframe[1:3, [\"column1\",\"column2\"]] = BiocFrame({\"x\":[4, 5], \"y\":[\"E\", \"F\"]})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Iterate over rows\n", - "\n", - "You can iterate over the rows of a `BiocFrame` object. `name` is `None` if the object does not contain any `row_names`. To iterate over the first two rows:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for name, row in bframe[:2,]:\n", - " print(name, row)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Combining objects\n", - "\n", - "`BiocFrame` implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils). For example, to combine by row:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import biocutils\n", - "\n", - "bframe1 = BiocFrame({\n", - " \"odd\": [1, 3, 5, 7, 9],\n", - " \"even\": [0, 2, 4, 6, 8],\n", - "})\n", - "\n", - "bframe2 = BiocFrame({\n", - " \"odd\": [11, 33, 55, 77, 99],\n", - " \"even\": [0, 22, 44, 66, 88],\n", - "})\n", - "\n", - "combined = biocutils.combine_rows(bframe1, bframe2)\n", - "print(combined)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Similarly, to combine by column:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "bframe3 = BiocFrame({\n", - " \"foo\": [\"A\", \"B\", \"C\", \"D\", \"E\"],\n", - " \"bar\": [True, False, True, False, True]\n", - "})\n", - "\n", - "combined = biocutils.combine_columns(bframe1, bframe3)\n", - "print(combined)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Relaxed combine operation\n", - "\n", - "By default, the combine methods assume that the number and identity of columns (for `combine_rows()`) or rows (for `combine_columns()`) are the same across objects. In situations where this is not the case, such as having different columns across objects, we can use `relaxed_combine_rows()` instead:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from biocframe import relaxed_combine_rows\n", - "\n", - "modified2 = bframe2.set_column(\"foo\", [\"A\", \"B\", \"C\", \"D\", \"E\"])\n", - "\n", - "combined = biocutils.relaxed_combine_rows(bframe1, modified2)\n", - "print(combined)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Sql-like join operation\n", - "\n", - "Similarly, if the rows are different, we can use `BiocFrame`'s `merge` function. This function uses the *row_names* as the index to perform this operation; you can specify an alternative set of keys through the `by` parameter." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from biocframe import merge\n", - "\n", - "modified1 = bframe1.set_row_names([\"A\", \"B\", \"C\", \"D\", \"E\"])\n", - "modified3 = bframe3.set_row_names([\"C\", \"D\", \"E\", \"F\", \"G\"])\n", - "\n", - "combined = merge([modified1, modified3], by=None, join=\"outer\")\n", - "print(combined)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Empty Frames\n", - "\n", - "We can create empty `BiocFrame` objects that only specify the number of rows. This is beneficial in scenarios where `BiocFrame` objects are incorporated into larger data structures but do not contain any data themselves." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "empty = BiocFrame(number_of_rows=100)\n", - "print(empty)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Most operations detailed in this document can be performed on an empty `BiocFrame` object." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Column names:\", empty.column_names)\n", - "\n", - "subset_empty = empty[1:10,:]\n", - "print(\"\\nSubsetting an empty BiocFrame: \\n\", subset_empty)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Further reading\n", - "\n", - "- Explore more details [the reference documentation](https://biocpy.github.io/BiocFrame/).\n", - "- Additionally, take a look at Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package, which implements the `DFrame` class upon which `BiocFrame` was built." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "biocpy", - "language": "python", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.9.17" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/docs/tutorial.md b/docs/tutorial.md new file mode 100644 index 0000000..c91eb5c --- /dev/null +++ b/docs/tutorial.md @@ -0,0 +1,408 @@ +--- +file_format: mystnb +kernelspec: + name: python +--- + +# `BiocFrame` - Bioconductor-like data frames + +`BiocFrame` class is a Bioconductor-friendly data frame class. Its primary advantage lies in not making assumptions about the types of the columns - as long as an object has a length (`__len__`) and supports slicing methods (`__getitem__`), it can be used inside a `BiocFrame`. + +This flexibility allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects. Also check out Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package, which implements the `DFrame` class on which `BiocFrame` was based. + +These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](https://biocpy.github.io/tutorial/chapters/philosophy.html) section. + +# Advantages of `BiocFrame` + +One of the core principles guiding the implementation of the `BiocFrame` class is "**_what you put is what you get_**". Unlike Pandas `DataFrame`, `BiocFrame` makes no assumptions about the types of the columns provided as input. Some key differences to highlight the advantages of using `BiocFrame` are especially in terms of modifications to column types and handling nested dataframes. + +## Inadvertent modification of types + +As an example, Pandas `DataFrame` modifies the types of the input data. These assumptions may cause issues when interoperating between R and python. + + +```{code-cell} +import pandas as pd +import numpy as np +from array import array + +df = pd.DataFrame({ + "numpy_vec": np.zeros(10), + "list_vec": [1]* 10, + "native_array_vec": array('d', [3.14] * 10) # less used but native python arrays +}) + +print("type of numpy_vector column:", type(df["numpy_vec"]), df["numpy_vec"].dtype) +print("type of list_vector column:", type(df["list_vec"]), df["list_vec"].dtype) +print("type of native_array_vector column:", type(df["native_array_vec"]), df["native_array_vec"].dtype) + +print(df) +``` + +With `BiocFrame`, no assumptions are made, and the input data is not cast into (un)expected types: + + +```{code-cell} +from biocframe import BiocFrame +import numpy as np +from array import array + +bframe_types = BiocFrame({ + "numpy_vec": np.zeros(10), + "list_vec": [1]* 10, + "native_array_vec": array('d', [3.14] * 10) +}) + +print("type of numpy_vector column:", type(bframe_types["numpy_vec"])) +print("type of list_vector column:", type(bframe_types["list_vec"])) +print("type of native_array_vector column:", type(bframe_types["native_array_vec"])) +print(bframe_types) +``` + +This behavior remains consistent when extracting, slicing, combining, or performing any supported operation on `BiocFrame` objects. + +## Handling complex nested frames + +Pandas `DataFrame` does not support nested structures; therefore, running the snippet below will result in an error: + +```python +df = pd.DataFrame({ + "ensembl": ["ENS00001", "ENS00002", "ENS00002"], + "symbol": ["MAP1A", "BIN1", "ESR1"], + "ranges": pd.DataFrame({ + "chr": ["chr1", "chr2", "chr3"], + "start": [1000, 1100, 5000], + "end": [1100, 4000, 5500] + }), +}) +print(df) +``` + +However, it is handled seamlessly with `BiocFrame`: + + +```{code-cell} +bframe_nested = BiocFrame({ + "ensembl": ["ENS00001", "ENS00002", "ENS00002"], + "symbol": ["MAP1A", "BIN1", "ESR1"], + "ranges": BiocFrame({ + "chr": ["chr1", "chr2", "chr3"], + "start": [1000, 1100, 5000], + "end": [1100, 4000, 5500] + }), +}) + +print(bframe_nested) +``` + +This behavior remains consistent when extracting, slicing, combining, or performing any other supported operations on `BiocFrame` objects. + +# Construction + +Creating a `BiocFrame` object is straightforward; just provide the `data` as a dictionary. + + +```{code-cell} +from biocframe import BiocFrame + +obj = { + "ensembl": ["ENS00001", "ENS00002", "ENS00003"], + "symbol": ["MAP1A", "BIN1", "ESR1"], +} +bframe = BiocFrame(obj) +print(bframe) +``` + +You can specify complex objects as columns, as long as they have some "length" equal to the number of rows. + +For example, we can embed a `BiocFrame` within another `BiocFrame`. + + +```{code-cell} +obj = { + "ensembl": ["ENS00001", "ENS00002", "ENS00002"], + "symbol": ["MAP1A", "BIN1", "ESR1"], + "ranges": BiocFrame({ + "chr": ["chr1", "chr2", "chr3"], + "start": [1000, 1100, 5000], + "end": [1100, 4000, 5500] + }), +} + +bframe2 = BiocFrame(obj, row_names=["row1", "row2", "row3"]) +print(bframe2) +``` + +The `row_names` parameter is analogous to index in the pandas world and should not contain missing strings. Additionally, you may provide: + +- `column_data`: A `BiocFrame`object containing metadata about the columns. This must have the same number of rows as the numbers of columns. +- `metadata`: Additional metadata about the object, usually a dictionary. +- `column_names`: If different from the keys in the `data`. If not provided, this is automatically extracted from the keys in the `data`. + +# With other `DataFrame` libraries + +# Pandas +`BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R, many users may prefer working with **pandas** `DataFrame` objects for their actual analyses. This conversion is easily achieved: + + +```{code-cell} +from biocframe import BiocFrame +bframe3 = BiocFrame( + { + "foo": ["A", "B", "C", "D", "E"], + "bar": [True, False, True, False, True] + } +) + +df = bframe3.to_pandas() +print(type(df)) +print(df) +``` + +Converting back to a `BiocFrame` is similarly straightforward: + + +```{code-cell} +out = BiocFrame.from_pandas(df) +print(out) +``` + +## Polars + +Similarly, you can easily go back and forth between `BiocFrame` and a polars `DataFrame`: + +```{code-cell} +from biocframe import BiocFrame +bframe3 = BiocFrame( + { + "foo": ["A", "B", "C", "D", "E"], + "bar": [True, False, True, False, True] + } +) + +pl = bframe3.to_polars() +print(pl) +``` + +# Extracting data + +BiocPy classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](https://biocpy.github.io/tutorial/chapters/philosophy.html#functional-discipline) section. + +Properties can be directly accessed from the object: + + +```{code-cell} +print("shape:", bframe.shape) +print("column names (functional style):", bframe.get_column_names()) +print("column names (as property):", bframe.column_names) # same as above +``` + +We can fetch individual columns: + + +```{code-cell} +print("functional style:", bframe.get_column("ensembl")) +print("w/ accessor", bframe["ensembl"]) +``` + +And we can get individual rows as a dictionary: + + +```{code-cell} +bframe.get_row(2) +``` + +::: {.callout} +To retrieve a subset of the data in the `BiocFrame`, we use the subset (`[]`) operator. +This operator accepts different subsetting arguments, such as a boolean vector, a `slice` object, a sequence of indices, or row/column names. +::: + + +```{code-cell} +sliced_with_bools = bframe[1:2, [True, False, False]] +print("Subset using booleans: \n", sliced_with_bools) + +sliced_with_names = bframe[[0,2], ["symbol", "ensembl"]] +print("\nSubset using column names: \n", sliced_with_names) + +# Short-hand to get a single column: +print("\nShort-hand to get a single column: \n", bframe["ensembl"]) +``` + +# Setting data + +## Preferred approach + +For setting properties, we encourage a **functional style** of programming to avoid mutating the object directly. This helps prevent inadvertent modifications of `BiocFrame` instances within larger data structures. + + + +```{code-cell} +modified = bframe.set_column_names(["column1", "column2"]) +print(modified) +``` + +Now let's check the column names of the original object, + + +```{code-cell} +# Original is unchanged: +print(bframe.get_column_names()) +``` + +To add new columns, or replace existing ones: + + +```{code-cell} +modified = bframe.set_column("symbol", ["A", "B", "C"]) +print(modified) + +modified = bframe.set_column("new_col_name", range(2, 5)) +print(modified) +``` + +Change the row or column names: + + +```{code-cell} +modified = bframe.\ + set_column_names(["FOO", "BAR"]).\ + set_row_names(['alpha', 'bravo', 'charlie']) +print(modified) +``` + +***The functional style allows you to chain multiple operations.*** + +We also support Bioconductor's metadata concepts, either along the columns or for the entire object: + + +```{code-cell} +modified = bframe.\ + set_metadata({ "author": "Jayaram Kancherla" }).\ + set_column_data(BiocFrame({"column_source": ["Ensembl", "HGNC" ]})) +print(modified) +``` + +## The not-preferred-way + +Properties can also be set by direct assignment for in-place modification. We prefer not to do it this way as it can silently mutate ``BiocFrame`` instances inside other data structures. +Nonetheless: + + +```{code-cell} +testframe = BiocFrame({ "A": [1,2,3], "B": [4,5,6] }) +testframe.column_names = ["column1", "column2" ] +print(testframe) +``` + +Warnings are raised when properties are directly mutated. These assignments are the same as calling the corresponding `set_*()` methods with `in_place = True`. +It is best to do this only if the `BiocFrame` object is not being used anywhere else; +otherwise, it is safer to just create a (shallow) copy via the default `in_place = False`. + +Similarly, we could set or replace columns directly: + + +```{code-cell} +testframe["column2"] = ["A", "B", "C"] +testframe[1:3, ["column1","column2"]] = BiocFrame({"x":[4, 5], "y":["E", "F"]}) +``` + +# Iterate over rows + +You can iterate over the rows of a `BiocFrame` object. `name` is `None` if the object does not contain any `row_names`. To iterate over the first two rows: + + +```{code-cell} +for name, row in bframe[:2,]: + print(name, row) +``` + +# Combining objects + +`BiocFrame` implements methods for the various `combine` generics from [**BiocUtils**](https://github.com/BiocPy/biocutils). For example, to combine by row: + + + +```{code-cell} +import biocutils + +bframe1 = BiocFrame({ + "odd": [1, 3, 5, 7, 9], + "even": [0, 2, 4, 6, 8], +}) + +bframe2 = BiocFrame({ + "odd": [11, 33, 55, 77, 99], + "even": [0, 22, 44, 66, 88], +}) + +combined = biocutils.combine_rows(bframe1, bframe2) +print(combined) +``` + +Similarly, to combine by column: + + +```{code-cell} +bframe3 = BiocFrame({ + "foo": ["A", "B", "C", "D", "E"], + "bar": [True, False, True, False, True] +}) + +combined = biocutils.combine_columns(bframe1, bframe3) +print(combined) +``` + +## Relaxed combine operation + +By default, the combine methods assume that the number and identity of columns (for `combine_rows()`) or rows (for `combine_columns()`) are the same across objects. In situations where this is not the case, such as having different columns across objects, we can use `relaxed_combine_rows()` instead: + + +```{code-cell} +from biocframe import relaxed_combine_rows + +modified2 = bframe2.set_column("foo", ["A", "B", "C", "D", "E"]) + +combined = biocutils.relaxed_combine_rows(bframe1, modified2) +print(combined) +``` + +## Sql-like join operation + +Similarly, if the rows are different, we can use `BiocFrame`'s `merge` function. This function uses the *row_names* as the index to perform this operation; you can specify an alternative set of keys through the `by` parameter. + + +```{code-cell} +from biocframe import merge + +modified1 = bframe1.set_row_names(["A", "B", "C", "D", "E"]) +modified3 = bframe3.set_row_names(["C", "D", "E", "F", "G"]) + +combined = merge([modified1, modified3], by=None, join="outer") +print(combined) +``` + +# Empty Frames + +We can create empty `BiocFrame` objects that only specify the number of rows. This is beneficial in scenarios where `BiocFrame` objects are incorporated into larger data structures but do not contain any data themselves. + + +```{code-cell} +empty = BiocFrame(number_of_rows=100) +print(empty) +``` + +Most operations detailed in this document can be performed on an empty `BiocFrame` object. + + +```{code-cell} +print("Column names:", empty.column_names) + +subset_empty = empty[1:10,:] +print("\nSubsetting an empty BiocFrame: \n", subset_empty) +``` + +## Further reading + +- Explore more details [the reference documentation](https://biocpy.github.io/BiocFrame/). +- Additionally, take a look at Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package, which implements the `DFrame` class upon which `BiocFrame` was built.