Skip to content

Commit

Permalink
mostly finishing up biocframe
Browse files Browse the repository at this point in the history
  • Loading branch information
jkanche committed Jan 12, 2024
1 parent f67b813 commit c1f4a09
Show file tree
Hide file tree
Showing 5 changed files with 147 additions and 47 deletions.
Empty file removed chapters/functional.qmd
Empty file.
4 changes: 3 additions & 1 deletion chapters/philosophy.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ programming paradigm:
gr.flank(width=2, start=False, both=True)
```

### Functional discipline
### Functional discipline {#sec-functional}

The existence of mutable types in Python introduces the potential danger of
modifying complex objects.
Expand Down Expand Up @@ -110,6 +110,8 @@ def find_element(arr: List[str], query: Union[int, str, slice]):
pass
```

-----
## Notes
Additionally, we provide recommendations on setting up the package, different
testing environments, documentation, and publishing workflows.
These details can be found in the [developer guide](https://github.com/BiocPy/developer_guide).
176 changes: 137 additions & 39 deletions chapters/representations/biocframe.qmd
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# `BiocFrame` - Bioconductor-like data frames
# `BiocFrame` - Bioconductor-like data frames {.unnumbered}

`BiocFrame` class is a Bioconductor-friendly alternative to Pandas `DataFrame`. Its key advantage lies in not making assumptions on the types of the columns - as long as an object has a length (`__len__`) and supports slicing methods (`__getitem__`), it can be used inside a `BiocFrame`.
`BiocFrame` class is a Bioconductor-friendly alternative to Pandas `DataFrame`. Its primary advantage lies in not making assumptions about the types of the columns - as long as an object has a length (`__len__`) and supports slicing methods (`__getitem__`), it can be used inside a `BiocFrame`.

This flexibility allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects.
This flexibility allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects. Also check out Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package, which implements the `DFrame` class on which `BiocFrame` was based.

## Installation

Expand All @@ -12,9 +12,97 @@ To get started, install the package from [PyPI](https://pypi.org/project/biocfra
pip install biocframe
```

## Advantage of using `BiocFrame`

One of the core principles guiding the implementation of the `BiocFrame` class is "what you put is what you get." Unlike Pandas `DataFrame`, `BiocFrame` makes no assumptions about the types of the columns provided as input. Some key differences to highlight the advantages of using `BiocFrame` are especially in terms of modifications to column types and handling nested dataframes.

### Inadvertent modification of types

As an example, Pandas `DataFrame` modifies the types of the input data:

```{python}
import pandas as pd
import numpy as np
from array import array
df = pd.DataFrame({
"numpy_vec": np.zeros(10),
"list_vec": [1]* 10,
"native_array_vec": array('d', [3.14] * 10)
})
print("type of numpy_vector column:", type(df["numpy_vec"]), df["numpy_vec"].dtype)
print("type of list_vector column:", type(df["list_vec"]), df["list_vec"].dtype)
print("type of native_array_vector column:", type(df["native_array_vec"]), df["native_array_vec"].dtype)
print(df)
```

With `BiocFrame`, no assumptions are made,and the input data is not cast into expected types:

```{python}
from biocframe import BiocFrame
import numpy as np
from array import array
bframe_types = BiocFrame({
"numpy_vec": np.zeros(10),
"list_vec": [1]* 10,
"native_array_vec": array('d', [3.14] * 10)
})
print("type of numpy_vector column:", type(bframe_types["numpy_vec"]))
print("type of list_vector column:", type(bframe_types["list_vec"]))
print("type of native_array_vector column:", type(bframe_types["native_array_vec"]))
print(bframe_types)
```

:::{.callout-note}
This behavior remains consistent when extracting, slicing, combining, or performing any other supported operations on `BiocFrame` objects.
:::

### Handling complex nested frames

Pandas `DataFrame` does not support nested structures; therefore, running the snippet below will result in an error:

```{python}
#| eval: false
df = pd.DataFrame({
"ensembl": ["ENS00001", "ENS00002", "ENS00002"],
"symbol": ["MAP1A", "BIN1", "ESR1"],
"ranges": pd.DataFrame({
"chr": ["chr1", "chr2", "chr3"],
"start": [1000, 1100, 5000],
"end": [1100, 4000, 5500]
}),
})
print(df)
```

However, it is handled seamlessly with `BiocFrame`:

```{python}
bframe_nested = BiocFrame({
"ensembl": ["ENS00001", "ENS00002", "ENS00002"],
"symbol": ["MAP1A", "BIN1", "ESR1"],
"ranges": BiocFrame({
"chr": ["chr1", "chr2", "chr3"],
"start": [1000, 1100, 5000],
"end": [1100, 4000, 5500]
}),
})
print(bframe_nested)
```

:::{.callout-note}
This behavior remains consistent when extracting, slicing, combining, or performing any other supported operations on `BiocFrame` objects.
:::

## Construction

To create a `BiocFrame` object, simply provide the data as a dictionary.
Creating a `BiocFrame` object is straightforward; just provide the `data` as a dictionary.

```{python}
from biocframe import BiocFrame
Expand All @@ -29,7 +117,7 @@ print(bframe)

::: {.callout-tip}
You can specify complex objects as columns, as long as they have some "length" equal to the number of rows.
For example, we can embed a `BiocFrame` within another `BiocFrame`:
For example, we can embed a `BiocFrame` within another `BiocFrame`.
:::


Expand All @@ -48,8 +136,42 @@ bframe2 = BiocFrame(obj, row_names=["row1", "row2", "row3"])
print(bframe2)
```

The `row_names` parameter is analogous to index in the pandas world and should not contain missing strings. Additionally, you may provide:

- `column_data`: A `BiocFrame`object containing metadata about the columns. This must have the same number of rows as the numbers of columns.
- `metadata`: Additional metadata about the object, usually a dictionary.
- `column_names`: If different from the keys in the `data`. If not provided, this is automatically extracted from the keys in the `data`.

### Interop with pandas

`BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R, many users may prefer working with **pandas** `DataFrame` objects for their actual analyses. This conversion is easily achieved:

```{python}
from biocframe import BiocFrame
bframe3 = BiocFrame(
{
"foo": ["A", "B", "C", "D", "E"],
"bar": [True, False, True, False, True]
}
)
df = bframe3.to_pandas()
print(type(df))
print(df)
```

Converting back to a `BiocFrame` is similarly straightforward:

```{python}
out = BiocFrame.from_pandas(df)
print(out)
```


## Extracting data

BiocPy classes follow a functional paradigm for accessing or setting properties, with further details available in [@sec-functional].

Properties can be directly accessed from the object:

```{python}
Expand Down Expand Up @@ -91,21 +213,21 @@ print("\nShort-hand to get a single column: \n", bframe["ensembl"])

### Preferred approach

To set `BiocFrame` properties, we encourage a **functional style** of programming that avoids mutating the object. This avoids inadvertent modification of `BiocFrame` instances within larger data structures.
For setting properties, we encourage a **functional style** of programming to avoid mutating the object directly. This helps prevent inadvertent modifications of `BiocFrame` instances within larger data structures.

```{python}
modified = bframe.set_column_names(["column1", "column2"])
print(modified)
```

Now lets check the column names of the original object,
Now let's check the column names of the original object,

```{python}
# Original is unchanged:
print(bframe.get_column_names())
```

To add new columns, or replace existing columns:
To add new columns, or replace existing ones:

```{python}
modified = bframe.set_column("symbol", ["A", "B", "C"])
Expand Down Expand Up @@ -138,7 +260,7 @@ modified = bframe.\
print(modified)
```

### The other way
### The not-preferred-way

Properties can also be set by direct assignment for in-place modification. We prefer not to do it this way as it can silently mutate ``BiocFrame`` instances inside other data structures.
Nonetheless:
Expand All @@ -149,7 +271,7 @@ testframe.column_names = ["column1", "column2" ]
print(testframe)
```

::: {.callout-important}
::: {.callout-caution}
Warnings are raised when properties are directly mutated. These assignments are the same as calling the corresponding `set_*()` methods with `in_place = True`.
It is best to do this only if the `BiocFrame` object is not being used anywhere else;
otherwise, it is safer to just create a (shallow) copy via the default `in_place = False`.
Expand Down Expand Up @@ -199,6 +321,7 @@ print(combined)
By default, both methods above assume that the number and identity of columns (for `combine_rows()`) or rows (for `combine_columns()`) are the same across objects.
:::

### Relaxed combine operation
If this is not the case, e.g., with different columns across objects, we can use `relaxed_combine_rows()` instead:

```{python}
Expand All @@ -222,33 +345,10 @@ combined = merge([modified1, modified3], by=None, join="outer")
print(combined)
```

## Interop with pandas

`BiocFrame` is intended for accurate representation of Bioconductor objects for interoperability with R. Most users will probably prefer to work with **pandas** `DataFrame` objects for their actual analyses. This conversion is easily achieved:

```{python}
from biocframe import BiocFrame
bframe = BiocFrame(
{
"foo": ["A", "B", "C", "D", "E"],
"bar": [True, False, True, False, True]
}
)
pd = bframe.to_pandas()
print(pd)
```

Conversion back to a ``BiocFrame`` is similarly easy:

```{python}
out = BiocFrame.from_pandas(pd)
print(out)
```

## Empty Frames

We can create empty `BiocFrame` objects that only specify the number of rows. This proves beneficial in situations where `BiocFrame` objects are integrated into more extensive data structures but do not possess any data themselves.
We can create empty `BiocFrame` objects that only specify the number of rows. This is beneficial in situations where `BiocFrame` objects are integrated into more extensive data structures but do not contain any data themselves.

```{python}
empty = BiocFrame(number_of_rows=100)
Expand All @@ -264,12 +364,10 @@ subset_empty = empty[1:10,:]
print("\nSubsetting an empty BiocFrame: \n", subset_empty)
```

::: {.callout-tip}
Similarly one can create an empty `BiocFrame` with only row names.
:::
----

## Further reading
## Notes

Check out [the reference documentation](https://biocpy.github.io/BiocFrame/) for more details.

Also see check out Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package, which implements the `DFrame` class on which `BiocFrame` was based.
Also check out Bioconductor's [**S4Vectors**](https://bioconductor.org/packages/S4Vectors) package, which implements the `DFrame` class on which `BiocFrame` was based.
10 changes: 5 additions & 5 deletions chapters/representations/index.qmd
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
# The basics

This chapter introduces the core representations and classes available through BiocPy.
All packages in the `BiocPy` ecosystem are released on Python's package registry - [PyPI](https://pypi.org/).

All packages in the `BiocPy` ecosystem are published to Python's Package Index - [PyPI](https://pypi.org/).


`biocpy` is a wrapper package that install all core packages in the ecosystem.
The [biocpy](https://github.com/BiocPy/BiocPy) package serves as a convenient wrapper that installs all the core packages within the ecosystem.

```bash
pip install biocpy
```

OR install packages as needed. e.g.
Alternatively, you can install specific packages as required. For example:

```bash
pip install summarizedexperiment # <package-name>
```

# Update packages

To update packages, use the following command:

```bash
pip install -U biocpy # or <package-name>
```
4 changes: 2 additions & 2 deletions index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Index (PyPI).
For complete list of all packages, visit the
[GitHub:BiocPy](https://github.com/BiocPy) repository.

#### core representations:
#### Core representations:

- `BiocUtils` ([GitHub](https://github.com/BiocPy/BiocUtils), [Docs](https://biocpy.github.io/BiocUtils/)): Common utilities for use across packages, mostly to mimic convenient aspects of base R.
- `BiocFrame` ([GitHub](https://github.com/BiocPy/BiocFrame), [Docs](https://biocpy.github.io/BiocFrame/)): Bioconductor-like dataframes in Python.
Expand All @@ -77,6 +77,6 @@ For complete list of all packages, visit the
- `pyBiocFileCache` ([GitHub](https://github.com/BiocPy/pyBiocFileCache), [Docs](https://pypi.org/project/pyBiocFileCache/), [BioC](https://github.com/Bioconductor/BiocFileCache)): File system based cache for resources & metadata.

-----
#### Notes
## Notes

This is a reproducible Quarto book with ***reusable snippets***. To learn more about Quarto books visit <https://quarto.org/docs/books>.

0 comments on commit c1f4a09

Please sign in to comment.