Skip to content

Commit

Permalink
Switch to pybind11 and a complete rewrite of the package (#33)
Browse files Browse the repository at this point in the history
* Complete overhaul of the codebase using pybind11
* Streamlined readers for R data types
* Updated API for all classes and methods
* Updated documentation and tests.
  • Loading branch information
jkanche authored Oct 25, 2024
1 parent 7ce4931 commit 32f61ab
Show file tree
Hide file tree
Showing 60 changed files with 1,772 additions and 1,364 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
- name: Build docs
run: |
python setup.py build_ext --inplace
cp build/lib*/rds2py/rds_parser* src/rds2py/
cp build/lib*/rds2py/lib_rds_parser* src/rds2py/
tox -e docs
touch ./docs/_build/html/.nojekyll
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/publish-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:

build_macosx_x86_64:
name: Build wheels for macosx x86_64
runs-on: macos-11
runs-on: macos-13
steps:
- name: Check out repository
uses: actions/checkout@v3
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,6 @@ MANIFEST
.venv*/
.conda*/
.python-version

extern/rds2cpp*
src/rds2py/lib/parser.cpp
12 changes: 7 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,20 @@ repos:
args: [--in-place, --wrap-descriptions=120, --wrap-summaries=120]
# --config, ./pyproject.toml

- repo: https://github.com/psf/black
rev: 24.8.0
hooks:
- id: black
language_version: python3
# - repo: https://github.com/psf/black
# rev: 24.8.0
# hooks:
# - id: black
# language_version: python3

- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.6.8
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
# Run the formatter.
- id: ruff-format

## If like to embrace black styles even in the docs:
# - repo: https://github.com/asottile/blacken-docs
Expand Down
2 changes: 1 addition & 1 deletion AUTHORS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Contributors

* jkanche [[email protected]](mailto:[email protected])
* Jayaram Kancherla [[email protected]](mailto:[email protected])
8 changes: 6 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
# Changelog

## Development
## Version 0.5.0

- Fix github issue with showing incorrect package version on github pages.
- Complete overhaul of the codebase using pybind11
- Streamlined readers for R data types
- Updated API for all classes and methods
- Updated documentation and tests.

## Version 0.4.5

- Switch to pybind11 to implementing the bindings to rds2cpp.
- Update tests, documentation and actions.
- Fix github issue with showing incorrect package version on github pages.

## Version 0.4.4

Expand Down
107 changes: 71 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,69 +4,104 @@

# rds2py

Parse and construct Python representations for datasets stored in RDS files. `rds2py` supports a few base classes from R and Bioconductor's `SummarizedExperiment` and `SingleCellExperiment` S4 classes. **_This is possible because of [Aaron's rds2cpp library](https://github.com/LTLA/rds2cpp)._**

The package uses memory views (except for strings) to access the same memory from C++ in Python (through Cython of course). This is especially useful for large datasets so we don't make multiple copies of data.

## Install
Parse and construct Python representations for datasets stored in RDS files. `rds2py` supports various base classes from R, and Bioconductor's `SummarizedExperiment` and `SingleCellExperiment` S4 classes. ***For more details, check out [rds2cpp library](https://github.com/LTLA/rds2cpp).***

> **Important Version Notice**
>
> Version 0.5.0 brings major changes to the package:
> - Complete overhaul of the codebase using pybind11
> - Streamlined readers for R data types
> - Updated API for all classes and methods
>
> Please refer to the [documentation](https://biocpy.github.io/rds2py/) for the latest usage guidelines. Previous versions may have incompatible APIs.
The package provides:

- Efficient parsing of RDS files with *minimal* memory overhead
- Support for R's basic data types and complex S4 objects
- Vectors (numeric, character, logical)
- Factors
- Data frames
- Matrices (dense and sparse)
- Run-length encoded vectors (Rle)
- Conversion to appropriate Python/NumPy/SciPy data structures
- dgCMatrix (sparse column matrix)
- dgRMatrix (sparse row matrix)
- dgTMatrix (sparse triplet matrix)
- Preservation of metadata and attributes from R objects
- Integration with BiocPy ecosystem for Bioconductor classes
- SummarizedExperiment
- RangedSummarizedExperiment
- SingleCellExperiment
- GenomicRanges
- MultiAssayExperiment

## Installation

Package is published to [PyPI](https://pypi.org/project/rds2py/)

```shell
pip install rds2py
```

## Usage

If you do not have an RDS object handy, feel free to download one from [single-cell-test-files](https://github.com/jkanche/random-test-files/releases).
## Quick Start

```python
from rds2py import as_summarized_experiment, read_rds
from rds2py import read_rds

r_obj = read_rds(<path_to_file>)
# Read any RDS file
r_obj = read_rds("path/to/file.rds")
```

This `r_obj` holds a dictionary representation of the RDS file, we can now transform this object into Python representations.

`rObj` always contains two keys
## Usage

- `data`: If atomic entities, contains the NumPy view of the array.
- `attributes`: Additional properties available for the object.
If you do not have an RDS object handy, feel free to download one from [single-cell-test-files](https://github.com/jkanche/random-test-files/releases).

In addition, the package provides functions to convert parsed R objects into Python representations.
### Basic Usage

```python
from rds2py import as_spase_matrix, as_summarized_experiment

# to convert an robject to a sparse matrix
sp_mat = as_sparse(rObj)

# to convert an robject to SCE
sce = as_summarized_experiment(rObj)
from rds2py import read_rds
r_obj = read_rds("path/to/file.rds")
```

For more examples converting `data.frame`, `dgCMatrix`, `dgRMatrix`, `dgTMatrix` to Python, checkout the [documentation](https://biocpy.github.io/rds2py/).
The returned `r_obj` either returns an appropriate Python class if a parser is already implemented or returns the dictionary containing the data from the RDS file.

## Developer Notes
## Write-your-own-reader

This project uses Cython to provide bindings from C++ to Python.
In addition, the package provides the dictionary representation of the RDS file, allowing users to write their own custom readers into appropriate Python representations.

Steps to setup dependencies -
```python
from rds2py import parse_rds

- git submodules is initialized in `extern/rds2cpp`
- `cmake .` in `extern/rds2cpp` directory to download dependencies, especially the `byteme` library
data = parse_rds("path/to/file.rds")
print(data)
```

First one needs to build the extern library, this would generate a shared object file to `src/rds2py/core-[*].so`
if you know this RDS file contains an `GenomicRanges` object, you can use or modify the provided list reader, or write your own parser to convert this dictionary.

```shell
python setup.py build_ext --inplace
```python
from rds2py.read_granges import read_genomic_ranges

gr = read_genomic_ranges(data)
```

For typical development workflows, run
## Type Conversion Reference

```shell
python setup.py build_ext --inplace && tox
```
| R Type | Python/NumPy Type |
|--------|------------------|
| numeric | numpy.ndarray (float64) |
| integer | numpy.ndarray (int32) |
| character | list of str |
| logical | numpy.ndarray (bool) |
| factor | list |
| data.frame | BiocFrame |
| matrix | numpy.ndarray or scipy.sparse matrix |
| dgCMatrix | scipy.sparse.csc_matrix |
| dgRMatrix | scipy.sparse.csr_matrix |

## Developer Notes

This project uses pybind11 to provide bindings to the rds2cpp library. Please make sure necessary C++ compiler is installed on your system.

<!-- pyscaffold-notes -->

Expand Down
72 changes: 24 additions & 48 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,64 +2,40 @@

If you do not have an RDS object handy, feel free to download one from [single-cell-test-files](https://github.com/jkanche/random-test-files/releases).

## Step 1: Read a RDS file in Python

First we need to read the RDS file that can be easily explored in Python. The `read_rds` parses the R object and returns
a dictionary of the R object.
### Basic Usage

```python
from rds2py import read_rds

rObj = read_rds(<path_to_file>)
r_obj = read_rds("path/to/file.rds")
```

Once we have a realized structure, we can now convert this object to useful Python representations. It contains two keys

- `data`: If atomic entities, contains the numpy view of the memory space.
- `attributes`: Additional properties available for the object.

The package provides friendly functions to convert some R representations to useful Python representations.

## Step 2: Python representations
The returned `r_obj` either returns an appropriate Python class if a parser is already implemented or returns the dictionary containing the data from the RDS file.

### Matrices
## Write-your-own-reader

Use these methods if the RDS file contains either a sparse matrix (`dgCMatrix`, `dgRMatrix`, or `dgTMatrix`) or a dense matrix.

**_Note: If an R object contains `dims` in the `attributes`, we consider this as a matrix._**
In addition, the package provides the dictionary representation of the RDS file, allowing users to write their own custom readers into appropriate Python representations.

```python
from rds2py import as_spase_matrix, as_dense_matrix

# to convert an robject to a sparse matrix
sp_mat = as_sparse_matrix(rObj)

# to convert an robject to a sparse matrix
dense_mat = as_dense_matrix(rObj)
```

### Pandas DataFrame
from rds2py import parse_rds

Methods are available to construct a pandas `DataFrame` from data stored in an RDS file. The package supports two R classes for this operation - `data.frame` and `DFrame` classes.

```python
from rds2py import as_pandas

# to convert an robject to DF
df = as_pandas(rObj)
```

### S4 classes: specifically `SingleCellExperiment` or `SummarizedExperiment`

We also support `SingleCellExperiment` or `SummarizedExperiment` from Bioconductor. the `as_summarized_experiment` method is how we one can do this operation.

**_Note: This method also serves as an example on how to convert complex R structures into Python representations._**

```python
from rds2py import as_summarized_experiment
data = parse_rds("path/to/file.rds")
print(data)

# to convert an robject to SCE
sp_mat = as_summarized_experiment(rObj)
# now write your own parser to convert this dictionary.
```

Well thats it, hack on & create more base representations to encapsulate complex structures. If you want to add more representations, feel free to send a PR!
## Type Conversion Reference

| R Type | Python/NumPy Type |
|--------|------------------|
| numeric | numpy.ndarray (float64) |
| integer | numpy.ndarray (int32) |
| character | list of str |
| logical | numpy.ndarray (bool) |
| factor | list |
| data.frame | BiocFrame |
| matrix | numpy.ndarray or scipy.sparse matrix |
| dgCMatrix | scipy.sparse.csc_matrix |
| dgRMatrix | scipy.sparse.csr_matrix |

Check out the module reference for more information on these classes.
2 changes: 1 addition & 1 deletion lib/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,6 @@ set_property(TARGET ${TARGET} PROPERTY CXX_STANDARD 17)
target_link_libraries(${TARGET} PRIVATE rds2cpp pybind11::pybind11)

set_target_properties(${TARGET} PROPERTIES
OUTPUT_NAME rds_parser
OUTPUT_NAME lib_rds_parser
PREFIX ""
)
4 changes: 2 additions & 2 deletions lib/src/rdswrapper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ class RdsReader {
if (!ptr) throw std::runtime_error("Null pointer in 'get_rtype'.");
// py::print("arg::", static_cast<int>(ptr->type()));
switch (ptr->type()) {
case rds2cpp::SEXPType::S4: return "S4";
case rds2cpp::SEXPType::INT: return "integer";
case rds2cpp::SEXPType::REAL: return "double";
case rds2cpp::SEXPType::STR: return "string";
case rds2cpp::SEXPType::LGL: return "boolean";
case rds2cpp::SEXPType::VEC: return "vector";
case rds2cpp::SEXPType::S4: return "S4";
case rds2cpp::SEXPType::NIL: return "null";
default: return "other";
}
Expand Down Expand Up @@ -164,7 +164,7 @@ class RdsObject {
}
};

PYBIND11_MODULE(rds_parser, m) {
PYBIND11_MODULE(lib_rds_parser, m) {
py::register_exception<std::runtime_error>(m, "RdsParserError");

py::class_<RdsObject>(m, "RdsObject")
Expand Down
13 changes: 6 additions & 7 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

[metadata]
name = rds2py
description = Parse and read RDS files as Python representations
description = Parse and construct Python representations for datasets stored in RDS files
author = jkanche
author_email = [email protected]
license = MIT
Expand Down Expand Up @@ -50,11 +50,13 @@ python_requires = >=3.8
install_requires =
importlib-metadata; python_version<"3.8"
numpy
pandas
scipy
biocutils>=0.1.5
singlecellexperiment>=0.4.1
summarizedexperiment>=0.4.1
genomicranges>=0.4.9
biocframe
multiassayexperiment

[options.packages.find]
where = src
Expand All @@ -65,17 +67,14 @@ exclude =
# Add here additional requirements for extra features, to install with:
# `pip install rds2py[PDF]` like:
# PDF = ReportLab; RXP
optional =
pandas

# Add here test requirements (semicolon/line-separated)
testing =
setuptools
pytest
pytest-cov
numpy
pandas
scipy
singlecellexperiment
summarizedexperiment

[options.entry_points]
# Add here console scripts like:
Expand Down
5 changes: 1 addition & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,7 @@ def build_cmake(self, ext):
"lib",
"-B",
build_temp,
"-Dpybind11_DIR="
+ os.path.join(
os.path.dirname(pybind11.__file__), "share", "cmake", "pybind11"
),
"-Dpybind11_DIR=" + os.path.join(os.path.dirname(pybind11.__file__), "share", "cmake", "pybind11"),
"-DPYTHON_EXECUTABLE=" + sys.executable,
]
if os.name != "nt":
Expand Down
Loading

0 comments on commit 32f61ab

Please sign in to comment.