Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul of the package #33

Merged
merged 64 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
90ede9d
starting to rewrite
jkanche Jan 25, 2024
46c1f95
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 25, 2024
08f809a
a few more things
jkanche Jan 31, 2024
fbf9a26
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Jan 31, 2024
4beb157
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 31, 2024
db94a55
Merge branch 'master' of https://github.com/BiocPy/rds2py into rewrite
jkanche Mar 18, 2024
eec5503
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Mar 18, 2024
b63e8d2
reading atomics
jkanche Mar 18, 2024
214a263
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 18, 2024
b90452a
remove cython generated file
jkanche Mar 18, 2024
bc76935
import all from core
jkanche Mar 19, 2024
db0522a
very confused why this fails
jkanche Mar 19, 2024
6826586
always use utf8 encoding/decoding
jkanche Mar 19, 2024
95d4b20
remove top level import
jkanche Mar 19, 2024
adc2c5c
renaming cython bindings
jkanche Mar 26, 2024
2b09962
commit the last thing
jkanche Jun 27, 2024
049f685
Merge branch 'master' into rewrite
jkanche Oct 9, 2024
4d37553
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 9, 2024
d16aeec
read atomics
jkanche Oct 9, 2024
7ae1481
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 9, 2024
b03d52f
remove build file
jkanche Oct 9, 2024
84a297e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 9, 2024
69efe1f
EOD
jkanche Oct 17, 2024
5831e9a
Merge branch 'master' into rewrite
jkanche Oct 22, 2024
1735103
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
75a64e0
minor changes
jkanche Oct 22, 2024
8050a2b
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 22, 2024
68f2c66
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
982c654
finishing up parsers for base classes
jkanche Oct 22, 2024
968f414
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 22, 2024
03f959e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
cb7a948
EOD
jkanche Oct 22, 2024
0f0b021
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 22, 2024
10b253e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
05fe1a8
finished granges
jkanche Oct 22, 2024
758ec17
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 22, 2024
01df16a
macos-11 is deprecated
jkanche Oct 22, 2024
8078532
add se, rse
jkanche Oct 23, 2024
ff9c534
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 23, 2024
03f45ac
EOD
jkanche Oct 23, 2024
dfc8d65
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 23, 2024
56fd31b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 23, 2024
6ab9a69
EOD: parse dicts, fix for dframes with nrows, and trying to parse SCE
jkanche Oct 23, 2024
b0aeada
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 24, 2024
d7440c2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 24, 2024
7ac0aec
streamlining dict objects
jkanche Oct 24, 2024
f7301f6
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 24, 2024
d8d1b0e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 24, 2024
7c2cae7
remove history file
jkanche Oct 24, 2024
95d495f
parse sce's
jkanche Oct 24, 2024
7c838fc
can now read MAE's
jkanche Oct 25, 2024
e683821
remove files
jkanche Oct 25, 2024
fd9df49
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 25, 2024
869cd7d
update pre-commit
jkanche Oct 25, 2024
fe3c3d1
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 25, 2024
cb4fd8c
add ruffs formatter
jkanche Oct 25, 2024
1cbbd2c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 25, 2024
b3f253f
remove top level exports
jkanche Oct 25, 2024
3a82df2
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 25, 2024
647a2a9
adding docstrings
jkanche Oct 25, 2024
bd513af
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 25, 2024
bf84197
almost ready for release, update docs
jkanche Oct 25, 2024
ad3ed29
Merge branch 'rewrite' of https://github.com/BiocPy/rds2py into rewrite
jkanche Oct 25, 2024
c890f3b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
- name: Build docs
run: |
python setup.py build_ext --inplace
cp build/lib*/rds2py/rds_parser* src/rds2py/
cp build/lib*/rds2py/lib_rds_parser* src/rds2py/
tox -e docs
touch ./docs/_build/html/.nojekyll

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/publish-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:

build_macosx_x86_64:
name: Build wheels for macosx x86_64
runs-on: macos-11
runs-on: macos-13
steps:
- name: Check out repository
uses: actions/checkout@v3
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,6 @@ MANIFEST
.venv*/
.conda*/
.python-version

extern/rds2cpp*
src/rds2py/lib/parser.cpp
12 changes: 7 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,20 @@ repos:
args: [--in-place, --wrap-descriptions=120, --wrap-summaries=120]
# --config, ./pyproject.toml

- repo: https://github.com/psf/black
rev: 24.8.0
hooks:
- id: black
language_version: python3
# - repo: https://github.com/psf/black
# rev: 24.8.0
# hooks:
# - id: black
# language_version: python3

- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.6.8
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
# Run the formatter.
- id: ruff-format

## If like to embrace black styles even in the docs:
# - repo: https://github.com/asottile/blacken-docs
Expand Down
2 changes: 1 addition & 1 deletion AUTHORS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Contributors

* jkanche [[email protected]](mailto:[email protected])
* Jayaram Kancherla [[email protected]](mailto:[email protected])
8 changes: 6 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
# Changelog

## Development
## Version 0.5.0

- Fix github issue with showing incorrect package version on github pages.
- Complete overhaul of the codebase using pybind11
- Streamlined readers for R data types
- Updated API for all classes and methods
- Updated documentation and tests.

## Version 0.4.5

- Switch to pybind11 to implementing the bindings to rds2cpp.
- Update tests, documentation and actions.
- Fix github issue with showing incorrect package version on github pages.

## Version 0.4.4

Expand Down
107 changes: 71 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,69 +4,104 @@

# rds2py

Parse and construct Python representations for datasets stored in RDS files. `rds2py` supports a few base classes from R and Bioconductor's `SummarizedExperiment` and `SingleCellExperiment` S4 classes. **_This is possible because of [Aaron's rds2cpp library](https://github.com/LTLA/rds2cpp)._**

The package uses memory views (except for strings) to access the same memory from C++ in Python (through Cython of course). This is especially useful for large datasets so we don't make multiple copies of data.

## Install
Parse and construct Python representations for datasets stored in RDS files. `rds2py` supports various base classes from R, and Bioconductor's `SummarizedExperiment` and `SingleCellExperiment` S4 classes. ***For more details, check out [rds2cpp library](https://github.com/LTLA/rds2cpp).***

> **Important Version Notice**
>
> Version 0.5.0 brings major changes to the package:
> - Complete overhaul of the codebase using pybind11
> - Streamlined readers for R data types
> - Updated API for all classes and methods
>
> Please refer to the [documentation](https://biocpy.github.io/rds2py/) for the latest usage guidelines. Previous versions may have incompatible APIs.

The package provides:

- Efficient parsing of RDS files with *minimal* memory overhead
- Support for R's basic data types and complex S4 objects
- Vectors (numeric, character, logical)
- Factors
- Data frames
- Matrices (dense and sparse)
- Run-length encoded vectors (Rle)
- Conversion to appropriate Python/NumPy/SciPy data structures
- dgCMatrix (sparse column matrix)
- dgRMatrix (sparse row matrix)
- dgTMatrix (sparse triplet matrix)
- Preservation of metadata and attributes from R objects
- Integration with BiocPy ecosystem for Bioconductor classes
- SummarizedExperiment
- RangedSummarizedExperiment
- SingleCellExperiment
- GenomicRanges
- MultiAssayExperiment

## Installation

Package is published to [PyPI](https://pypi.org/project/rds2py/)

```shell
pip install rds2py
```

## Usage

If you do not have an RDS object handy, feel free to download one from [single-cell-test-files](https://github.com/jkanche/random-test-files/releases).
## Quick Start

```python
from rds2py import as_summarized_experiment, read_rds
from rds2py import read_rds

r_obj = read_rds(<path_to_file>)
# Read any RDS file
r_obj = read_rds("path/to/file.rds")
```

This `r_obj` holds a dictionary representation of the RDS file, we can now transform this object into Python representations.

`rObj` always contains two keys
## Usage

- `data`: If atomic entities, contains the NumPy view of the array.
- `attributes`: Additional properties available for the object.
If you do not have an RDS object handy, feel free to download one from [single-cell-test-files](https://github.com/jkanche/random-test-files/releases).

In addition, the package provides functions to convert parsed R objects into Python representations.
### Basic Usage

```python
from rds2py import as_spase_matrix, as_summarized_experiment

# to convert an robject to a sparse matrix
sp_mat = as_sparse(rObj)

# to convert an robject to SCE
sce = as_summarized_experiment(rObj)
from rds2py import read_rds
r_obj = read_rds("path/to/file.rds")
```

For more examples converting `data.frame`, `dgCMatrix`, `dgRMatrix`, `dgTMatrix` to Python, checkout the [documentation](https://biocpy.github.io/rds2py/).
The returned `r_obj` either returns an appropriate Python class if a parser is already implemented or returns the dictionary containing the data from the RDS file.

## Developer Notes
## Write-your-own-reader

This project uses Cython to provide bindings from C++ to Python.
In addition, the package provides the dictionary representation of the RDS file, allowing users to write their own custom readers into appropriate Python representations.

Steps to setup dependencies -
```python
from rds2py import parse_rds

- git submodules is initialized in `extern/rds2cpp`
- `cmake .` in `extern/rds2cpp` directory to download dependencies, especially the `byteme` library
data = parse_rds("path/to/file.rds")
print(data)
```

First one needs to build the extern library, this would generate a shared object file to `src/rds2py/core-[*].so`
if you know this RDS file contains an `GenomicRanges` object, you can use or modify the provided list reader, or write your own parser to convert this dictionary.

```shell
python setup.py build_ext --inplace
```python
from rds2py.read_granges import read_genomic_ranges

gr = read_genomic_ranges(data)
```

For typical development workflows, run
## Type Conversion Reference

```shell
python setup.py build_ext --inplace && tox
```
| R Type | Python/NumPy Type |
|--------|------------------|
| numeric | numpy.ndarray (float64) |
| integer | numpy.ndarray (int32) |
| character | list of str |
| logical | numpy.ndarray (bool) |
| factor | list |
| data.frame | BiocFrame |
| matrix | numpy.ndarray or scipy.sparse matrix |
| dgCMatrix | scipy.sparse.csc_matrix |
| dgRMatrix | scipy.sparse.csr_matrix |

## Developer Notes

This project uses pybind11 to provide bindings to the rds2cpp library. Please make sure necessary C++ compiler is installed on your system.

<!-- pyscaffold-notes -->

Expand Down
72 changes: 24 additions & 48 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,64 +2,40 @@

If you do not have an RDS object handy, feel free to download one from [single-cell-test-files](https://github.com/jkanche/random-test-files/releases).

## Step 1: Read a RDS file in Python

First we need to read the RDS file that can be easily explored in Python. The `read_rds` parses the R object and returns
a dictionary of the R object.
### Basic Usage

```python
from rds2py import read_rds

rObj = read_rds(<path_to_file>)
r_obj = read_rds("path/to/file.rds")
```

Once we have a realized structure, we can now convert this object to useful Python representations. It contains two keys

- `data`: If atomic entities, contains the numpy view of the memory space.
- `attributes`: Additional properties available for the object.

The package provides friendly functions to convert some R representations to useful Python representations.

## Step 2: Python representations
The returned `r_obj` either returns an appropriate Python class if a parser is already implemented or returns the dictionary containing the data from the RDS file.

### Matrices
## Write-your-own-reader

Use these methods if the RDS file contains either a sparse matrix (`dgCMatrix`, `dgRMatrix`, or `dgTMatrix`) or a dense matrix.

**_Note: If an R object contains `dims` in the `attributes`, we consider this as a matrix._**
In addition, the package provides the dictionary representation of the RDS file, allowing users to write their own custom readers into appropriate Python representations.

```python
from rds2py import as_spase_matrix, as_dense_matrix

# to convert an robject to a sparse matrix
sp_mat = as_sparse_matrix(rObj)

# to convert an robject to a sparse matrix
dense_mat = as_dense_matrix(rObj)
```

### Pandas DataFrame
from rds2py import parse_rds

Methods are available to construct a pandas `DataFrame` from data stored in an RDS file. The package supports two R classes for this operation - `data.frame` and `DFrame` classes.

```python
from rds2py import as_pandas

# to convert an robject to DF
df = as_pandas(rObj)
```

### S4 classes: specifically `SingleCellExperiment` or `SummarizedExperiment`

We also support `SingleCellExperiment` or `SummarizedExperiment` from Bioconductor. the `as_summarized_experiment` method is how we one can do this operation.

**_Note: This method also serves as an example on how to convert complex R structures into Python representations._**

```python
from rds2py import as_summarized_experiment
data = parse_rds("path/to/file.rds")
print(data)

# to convert an robject to SCE
sp_mat = as_summarized_experiment(rObj)
# now write your own parser to convert this dictionary.
```

Well thats it, hack on & create more base representations to encapsulate complex structures. If you want to add more representations, feel free to send a PR!
## Type Conversion Reference

| R Type | Python/NumPy Type |
|--------|------------------|
| numeric | numpy.ndarray (float64) |
| integer | numpy.ndarray (int32) |
| character | list of str |
| logical | numpy.ndarray (bool) |
| factor | list |
| data.frame | BiocFrame |
| matrix | numpy.ndarray or scipy.sparse matrix |
| dgCMatrix | scipy.sparse.csc_matrix |
| dgRMatrix | scipy.sparse.csr_matrix |

Check out the module reference for more information on these classes.
2 changes: 1 addition & 1 deletion lib/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,6 @@ set_property(TARGET ${TARGET} PROPERTY CXX_STANDARD 17)
target_link_libraries(${TARGET} PRIVATE rds2cpp pybind11::pybind11)

set_target_properties(${TARGET} PROPERTIES
OUTPUT_NAME rds_parser
OUTPUT_NAME lib_rds_parser
PREFIX ""
)
4 changes: 2 additions & 2 deletions lib/src/rdswrapper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ class RdsReader {
if (!ptr) throw std::runtime_error("Null pointer in 'get_rtype'.");
// py::print("arg::", static_cast<int>(ptr->type()));
switch (ptr->type()) {
case rds2cpp::SEXPType::S4: return "S4";
case rds2cpp::SEXPType::INT: return "integer";
case rds2cpp::SEXPType::REAL: return "double";
case rds2cpp::SEXPType::STR: return "string";
case rds2cpp::SEXPType::LGL: return "boolean";
case rds2cpp::SEXPType::VEC: return "vector";
case rds2cpp::SEXPType::S4: return "S4";
case rds2cpp::SEXPType::NIL: return "null";
default: return "other";
}
Expand Down Expand Up @@ -164,7 +164,7 @@ class RdsObject {
}
};

PYBIND11_MODULE(rds_parser, m) {
PYBIND11_MODULE(lib_rds_parser, m) {
py::register_exception<std::runtime_error>(m, "RdsParserError");

py::class_<RdsObject>(m, "RdsObject")
Expand Down
13 changes: 6 additions & 7 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

[metadata]
name = rds2py
description = Parse and read RDS files as Python representations
description = Parse and construct Python representations for datasets stored in RDS files
author = jkanche
author_email = [email protected]
license = MIT
Expand Down Expand Up @@ -50,11 +50,13 @@ python_requires = >=3.8
install_requires =
importlib-metadata; python_version<"3.8"
numpy
pandas
scipy
biocutils>=0.1.5
singlecellexperiment>=0.4.1
summarizedexperiment>=0.4.1
genomicranges>=0.4.9
biocframe
multiassayexperiment

[options.packages.find]
where = src
Expand All @@ -65,17 +67,14 @@ exclude =
# Add here additional requirements for extra features, to install with:
# `pip install rds2py[PDF]` like:
# PDF = ReportLab; RXP
optional =
pandas

# Add here test requirements (semicolon/line-separated)
testing =
setuptools
pytest
pytest-cov
numpy
pandas
scipy
singlecellexperiment
summarizedexperiment

[options.entry_points]
# Add here console scripts like:
Expand Down
5 changes: 1 addition & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,7 @@ def build_cmake(self, ext):
"lib",
"-B",
build_temp,
"-Dpybind11_DIR="
+ os.path.join(
os.path.dirname(pybind11.__file__), "share", "cmake", "pybind11"
),
"-Dpybind11_DIR=" + os.path.join(os.path.dirname(pybind11.__file__), "share", "cmake", "pybind11"),
"-DPYTHON_EXECUTABLE=" + sys.executable,
]
if os.name != "nt":
Expand Down
Loading
Loading