Skip to content

Latest commit

 

History

History
190 lines (138 loc) · 6.83 KB

README.md

File metadata and controls

190 lines (138 loc) · 6.83 KB

PyPI-Server Monthly Downloads Unit tests

Python bindings for tatami

Overview

The mattress package implements Python bindings to the tatami C++ library for matrix representations. Downstream packages can use mattress to develop C++ extensions that are interoperable with many different matrix classes, e.g., dense, sparse, delayed or file-backed. mattress is inspired by the beachmat Bioconductor package, which does the same thing for R packages.

Instructions

mattress is published to PyPI, so installation is simple:

pip install mattress

mattress is intended for Python package developers writing C++ extensions that operate on matrices. The aim is to allow package C++ code to accept all types of matrix representations without requiring re-compilation of the associated code. To achive this:

  1. Add mattress.includes() and assorthead.includes() to the compiler's include path. This can be done through include_dirs= of the Extension() definition in setup.py or by adding a target_include_directories() in CMake, depending on the build system.
  2. Call mattress.initialize() on a Python matrix object to wrap it in a tatami-compatible C++ representation. This returns an InitializedMatrix with a ptr property that contains a pointer to the C++ matrix.
  3. Pass ptr to C++ code as a uintptr_t referencing a tatami::Matrix, which can be interrogated as described in the tatami documentation.

So, for example, the C++ code in our downstream package might look like the code below:

#include "mattress.h"

int do_something(uintptr_t ptr) {
    const auto& mat_ptr = mattress::cast(ptr)->ptr;
    // Do something with the tatami interface.
    return 1;
}

// Assuming we're using pybind11, but any framework that can accept a uintptr_t is fine.
PYBIND11_MODULE(lib_downstream, m) {
    m.def("do_something", &do_something);
}

Which can then be called from Python:

from . import lib_downstream as lib
from mattress import initialize

def do_something(x):
    tmat = initialize(x)
    return lib.do_something(tmat.ptr)

Check out the included header for more definitions.

Supported matrices

Dense numpy matrices of varying numeric type:

import numpy as np
from mattress import initialize
x = np.random.rand(1000, 100)
init = initialize(x)

ix = (x * 100).astype(np.uint16)
init2 = initialize(ix)

Compressed sparse matrices from scipy with varying index/data types:

from scipy import sparse as sp
from mattress import initialize

xc = sp.random(100, 20, format="csc")
init = initialize(xc)

xr = sp.random(100, 20, format="csc", dtype=np.uint8)
init2 = initialize(xr)

Delayed arrays from the delayedarray package:

from delayedarray import DelayedArray
from scipy import sparse as sp
from mattress import initialize
import numpy

xd = DelayedArray(sp.random(100, 20, format="csc"))
xd = numpy.log1p(xd * 5)

init = initialize(xd)

Sparse arrays from delayedarray are also supported:

import delayedarray
from numpy import float64, int32
from mattress import initialize
sa = delayedarray.SparseNdarray((50, 20), None, dtype=float64, index_dtype=int32)
init = initialize(sa)

See below to extend initialize() to custom matrix representations.

Utility methods

The InitializedMatrix instance returned by initialize() provides a few Python-visible methods for querying the C++ matrix.

init.nrow() // number of rows
init.column(1) // contents of column 1
init.sparse() // whether the matrix is sparse.

It also has a few methods for computing common statistics:

init.row_sums()
init.column_variances(num_threads = 2)

grouping = [i%3 for i in range(init.ncol())]
init.row_medians_by_group(grouping)

init.row_nan_counts()
init.column_ranges()

These are mostly intended for non-intensive work or testing/debugging. It is expected that any serious computation should be performed by iterating over the matrix in C++.

Operating on an existing pointer

If we already have a InitializedMatrix, we can easily apply additional operations by wrapping it in the relevant delayedarray layers and calling initialize() afterwards. For example, if we want to add a scalar, we might do:

from delayedarray import DelayedArray
from mattress import initialize
import numpy

x = numpy.random.rand(1000, 10)
init = initialize(x)

wrapped = DelayedArray(init) + 1
init2 = initialize(wrapped)

This is more efficient as it re-uses the InitializedMatrix already generated from x. It is also more convenient as we don't have to carry around x to generate init2.

Extending to custom matrices

Developers can extend mattress to custom matrix classes by registering new methods with the initialize() generic. This should return a InitializedMatrix object containing a uintptr_t cast from a pointer to a tatami::Matrix (see the included header). Once this is done, all calls to initialize() will be able to handle matrices of the newly registered types.

from . import lib_downstream as lib
import mattress

@mattress.initialize.register
def _initialize_my_custom_matrix(x: MyCustomMatrix):
    data = x.some_internal_data
    return mattress.InitializedMatrix(lib.initialize_custom(data))

If the initialized tatami::Matrix contains references to Python-managed data, e.g., in NumPy arrays, we must ensure that the data is not garbage-collected during the lifetime of the tatami::Matrix. This is achieved by storing a reference to the data in the original member of the mattress::BoundMatrix.