feat(python): Add user-facing `Array` class #396

paleolimbot · 2024-03-01T16:05:08Z

This PR implements the nanoarrow.Array which basically a pyarrow.ChunkedArray. This can represent a Table, RecordBatch, ChunkedArray, and Array. It doesn't quite play nicely with pyarrow's ChunkedArray (but will after the next release, since __arrow_c_stream__ was just added there).

The user-facing class is backed by a Cython class, the CMaterializedArrayStream, which manages some of the c-level details like resolving a chunk + offset when there is more than one chunk in the array. An early version of this PR implemented the CMaterializedArrayStream using C pointers (e.g., ArrowArray* arrays), but I decided that was to complex and went back to List[CArray]. I think this is also better for managing ownership (e.g., more unneeded CArray instances can be released by the garbage collector).

The Array class as implemented here is device-aware, although until we have non-CPU support it's difficult to test this. The methods I added here are basically stubs just to demonstrate the intention.

This PR also implements the Scalar, whose main purpose for testing and other non-performance sensitive things (like lazier reprs for very large items or interactive inspection). They may also be useful for working with arrays that contain elements with very long strings or large arrays (e.g., geometry).

I also added some basic accessors like buffer(), child(), and some ways one might want to iterate over an Array to make the utility of this class more clear.

Basic usage:

import nanoarrow as na

na.Array(range(100), na.int64())

nanoarrow.Array<int64>[100]
0
1
2
3
4
5
6
7
8
9
...and 90 more items

More involved example reading from an IPC stream:

import nanoarrow as na
from nanoarrow.ipc import Stream

url = "https://github.com/apache/arrow-testing/raw/master/data/arrow-ipc-stream/integration/1.0.0-littleendian/generated_primitive.stream"

with Stream.from_url(url) as inp:
    array = na.Array(inp)

array.child(25)

nanoarrow.Array<string>[37]
'co矢2p矢m'
'wÂ€acrd'
'kjd1dlô'
'pib矢d5w'
'6nnpwôg'
'ndj£h£4'
'ôôf4aµg'
'kwÂh£fr'
'°g5dk€e'
'r€cbmdn'
...and 27 more items

paleolimbot · 2024-03-06T18:22:06Z

python/src/nanoarrow/_ipc_lib.pyx

These changes are because Array(Stream.from_url(...)) crashed due to these methods not managing the GIL properly. I can move these to a separate PR.

I think leaving it in is fine!

jorisvandenbossche · 2024-03-07T15:09:16Z

An early version of this PR implemented the CMaterializedArrayStream using C pointers (e.g., ArrowArray* arrays), but I decided that was to complex and went back to List[CArray]. I think this is also better for managing ownership (e.g., more unneeded CArray instances can be released by the garbage collector).

If it would at some point turn out to be a perf bottleneck to use an untyped list, we can also consider compiling the cython code to C++ instead of C and then can use things like vector

jorisvandenbossche

Nice work!
Some general thoughts:

I think I certainly like having a "container" Array class that hides the complexity of Array vs ChunkedArray like we have in pyarrow. For certain use cases, that is definitely a lot easier to work with.
But one downside of this is that there still isn't an easy to use but still low-level "ArrowArray" class that combines some features from CArray and CArrayView. For other use cases, it would be nice to dig into the object you have, check the buffers, understand what data you got, etc. Right now the Array class still gives you somewhat access to the CArray (see bullet point below as well), but when getting that you still need to do na.c_array_view(c_arr) to then get access to eg the Buffers.
I am the least sure (or enthusiastic, that's probably the better word ;)) about adding a Scalar class. Probably there is no way around needing it, if you want to be able to represent single values while adhering to the C interface type system. But it would also be nice to keep it as limited as possible (for example, do we think someone should ever iterate over all values of the Array as Scalar objects?)
Do we still want to expose the underlying CArrays a bit more publicly? Right now I can get them with list(arr._data.arrays), but we should maybe make that public (maybe a data attribute that returns that list? Although "data" is of course a very generic term ;))

jorisvandenbossche · 2024-03-07T15:34:00Z

python/src/nanoarrow/array.py

+
+
+class Scalar:
+    """Generic wrapper around an :class:`Array` element


Should the scalar class implement the __arrow_c_array/schema__ dunders? (like the underlying CScalar does)

And is it needed to have an additional class that wraps CScalar?
It feels as some unnecessary complexity, but I assume there are some things that need to be done in cython (the __arrow_c_array__ impl), while some other things here require the python classes (eg Schema).

Actually, I didn't think of it when I wrote this, but the CScalar probably should not implement __arrow_c_array__ because it is not necessarily representing a CPU array. To me the main utility of the Scalar class is on the interactive end of things (where the non-Cython wrapper gives a better doc/IDE experience).

Hmm...then the CArray would face the same problem and it would be very strange if the CArray did not implement __arrow_c_array__. I'll add it to the Scalar for now...we probably need the device_id and device_type in the CArray 🙁

Good call...I removed the CScalar!

Actually, I didn't think of it when I wrote this, but the CScalar probably should not implement __arrow_c_array__ because it is not necessarily representing a CPU array.

Yeah, it's true that things are not fully correct regarding devices right now, but I think that is something we can tackle later. i.e. when there is better device support, we would add __arrow_c_device_array__ in addition, which always works, and let __arrow_c_array__ raise an error or something like that.

The same happens in pyarrow actually. A pyarrow.Array can right now be backed by non-CPU data, but we still have a _arrow_c_array__ defined on it.

jorisvandenbossche · 2024-03-07T16:53:53Z

python/src/nanoarrow/array.py

+    2
+    3
+    """
+    return Array(obj, schema=schema)


If the array function is essentially just an alias for Array, there is not that much value to it. But I assume the intent is that later this constructor can be expanded?

For example, I assumed that I would be able to do something like

na.array([1,2,3])

but that doesn't seem to work (yet)

Ah, the error message is a bit confusing:

In [36]: na.array([1,2,3]) ... TypeError: Can't convert object of type list to nanoarrow.c_array_stream or nanoarrow.c_array

because the above actually does work when specifying a type:

In [35]: na.array([1,2,3], na.int64()) Out[35]: Array<INT64>[3] 1 2 3

I modelled this after the fact that we have na.schema(). I would be game to drop na.schema() and na.array() in favour of the regular class constructor!

Good call on the error!

python/src/nanoarrow/array.py

jorisvandenbossche · 2024-03-07T17:06:36Z

python/src/nanoarrow/_lib.pyx

+
+    cdef int _resolve_chunk(self, const int64_t* sorted_offsets, int64_t index, int64_t start_offset_i,
+                           int64_t end_offset_i) noexcept nogil:
+        if start_offset_i >= (end_offset_i - 1):


Can you add a bit of documentation here about the keywords?

optional: i challenge you to implement this iteratively instead of recursively to save on stack space usage ;)

Good call! Just leaving myself a mental note that the Arrow C++ implementation is here: https://github.com/apache/arrow/blob/1eb46f763a73d313466fdc895eae1f35fac37945/cpp/src/arrow/chunk_resolver.h#L133-L162

Ok, this is now ArrowResolveChunk64() in the C library. It will also be needed for random access into REE arrays and I'd like to use it in the R package to do something similar to this PR in the R package.

python/src/nanoarrow/_lib.pyx

jorisvandenbossche · 2024-03-07T17:13:32Z

python/src/nanoarrow/_lib.pyx

+            kint = k
+            if kint < 0:
+                kint += self._ptr.length
+            return CScalar(self, kint)


Is this needed for CArray? We could also leave this for just the higher-level Array object to give scalars?

Good call! The CScalar is no more!

danepitkin · 2024-03-07T19:19:26Z

I am the least sure (or enthusiastic, that's probably the better word ;)) about adding a Scalar class. Probably there is no way around needing it, if you want to be able to represent single values while adhering to the C interface type system. But it would also be nice to keep it as limited as possible (for example, do we think someone should ever iterate over all values of the Array as Scalar objects?)

This idea came out of #397 where I suggested having types implement their own reprs vs having a top-down function that introspects how to represent types. Does this feel like a good use case for the Scalar class?

danepitkin

Awesome work!

danepitkin · 2024-03-07T19:20:06Z

python/src/nanoarrow/_ipc_lib.pyx

I think leaving it in is fine!

danepitkin · 2024-03-07T19:28:21Z

python/src/nanoarrow/_ipc_lib.pyx

+        return self._obj
+
+    @property
+    def close_stream(self):


Why does this object hold close_stream state? Do we not want py_input_stream_release to always close the stream?

It currently only closes a file object if the higher level Python wrapping this was the code that opened it (e.g., it won't close a connection that a user opened in case the user had other plans or the IPC data was only part of the file payload).

Probably this should be close_file_obj to disambiguate the three types of streams we have going on here (ArrowArray, ArrowIpcInput, Python file object).

danepitkin · 2024-03-07T19:31:49Z

python/src/nanoarrow/_lib.pyx

+
+    This is an implementation of a "Scalar" that is a thin
+    wrapper around a (Python) CArray with the ability to
+    materialize a length-one array.


danepitkin · 2024-03-07T19:43:23Z

python/src/nanoarrow/_lib.pyx

+
+    cdef int _resolve_chunk(self, const int64_t* sorted_offsets, int64_t index, int64_t start_offset_i,
+                           int64_t end_offset_i) noexcept nogil:
+        if start_offset_i >= (end_offset_i - 1):


optional: i challenge you to implement this iteratively instead of recursively to save on stack space usage ;)

danepitkin · 2024-03-07T19:49:11Z

python/src/nanoarrow/array.py

+        return next(iterator(self._c_scalar))
+
+    def __repr__(self) -> str:
+        width_hint = 80


Should we make the width_hint configureable via an additional api such as to_string()? Then repr can call self.to_string(length=80)

danepitkin · 2024-03-07T19:53:07Z

python/src/nanoarrow/array.py

+        return Schema(self._data.schema)
+
+    @property
+    def n_chunks(self) -> int:


nit: I'd prefer num_chunks for naming

Good call! That's the pa.ChunkedArray name too 👍

I went back to n_chunks here because everywhere else in the package uses n_xxxxx. I can't revert some of those because they're the same as the ArrowArray/ArrowSchema C structs and it would be strange if they were different 🤷

danepitkin · 2024-03-07T19:55:05Z

python/src/nanoarrow/array.py

+        return self._data.n_arrays
+
+    @property
+    def chunks(self) -> Iterable:


nit: I think iter_chunks() could be a better name and is better defined as a function. I typically think of properties as a static value.

For better or worse, nanoarrow is pretty deep in the "properties return iterators" pattern (e.g., schema.children and many others). pyarrow's ChunkedArray has a chunks property that returns a list() rather than an iterator, which was the vague inspiration. For now maybe I'll keep this returning an iterator but I'm definitely down to circle back and fix all the existing usage since I agree that it's a potentially confusing use of the @property.

Just kidding...good call! I organized all the fun ways you can iterate over an Array into a bunch of iter_xxx() methods, including iter_chunks().

For better or worse, nanoarrow is pretty deep in the "properties return iterators" pattern (e.g., schema.children and many others). pyarrow's ChunkedArray has a chunks property that returns a list() rather than an iterator, which was the vague inspiration.

Is there a performance reason that you prefer to work with iterators?
In most of those cases, the data already exists (most what might happen is that it gets wrapped in a new class), and so all those things should be pretty cheap?

If that's the case, I think it would actually be nice to keep the properties but let them just return a list.

Co-authored-by: Joris Van den Bossche <[email protected]>

paleolimbot · 2024-03-11T13:17:24Z

I am the least sure (or enthusiastic, that's probably the better word ;)) about adding a Scalar class.

I tried very hard to avoid it because I also highly dislike whenever I arrive at a Arrow C++/pyarrow's scalar. I think once the scalar has some more functionality (e.g., getting struct children, or list elements) it will be a little more clear. The main thing I wanted to use it for, as Dane mentioned, is an abbreviated repr implementation that wasn't severely misusing iterators; however, I think they are (or will be) useful in any context where somebody is interactively exploring something.

do we think someone should ever iterate over all values of the Array as Scalar objects?

Not really, no. I suppose __iter__ could error with a message listing some more useful ways in which one could iterate over an Array (chunks, scalars, python objects, buffers). That would keep the concept but prevent things like list(array) from doing something slow and not very useful.

paleolimbot · 2024-03-11T13:37:28Z

(Sorry for the spam, I keep finding things I haven't responded to!)

But one downside of this is that there still isn't an easy to use but still low-level "ArrowArray"

There's definitely room to expand any one of the Array, the CArray, or the CArrayView to fit what people actually want to do with nanoarrow in Python. For me the CArrayView is more of an internal tool to reduce the amount of C anybody needs to write to expose iterators...it's best used if you instantiate it once and reuse it for all the chunks in your array. I'll workshop some buffer iteration and see if that solves at least part of your desire for introspecting the data!

codecov-commenter · 2024-03-12T01:36:18Z

Codecov Report

Attention: Patch coverage is 90.37037% with 26 lines in your changes are missing coverage. Please review.

Project coverage is 88.75%. Comparing base (7cf50a3) to head (3041a02).
Report is 3 commits behind head on main.

Files	Patch %	Lines
python/src/nanoarrow/_ipc_lib.pyx	38.70%	19 Missing ⚠️
python/src/nanoarrow/_lib.pyx	92.95%	5 Missing ⚠️
python/src/nanoarrow/array.py	99.30%	1 Missing ⚠️
python/src/nanoarrow/iterator.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #396      +/-   ##
==========================================
+ Coverage   88.74%   88.75%   +0.01%     
==========================================
  Files          81       82       +1     
  Lines       14398    14634     +236     
==========================================
+ Hits        12778    12989     +211     
- Misses       1620     1645      +25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

danepitkin · 2024-03-12T16:42:14Z

src/nanoarrow/buffer_inline.h

@@ -28,6 +28,28 @@
 extern "C" {
 #endif

+// Modified from Arrow C++ (1eb46f76) cpp/src/arrow/chunk_resolver.h#L133-L162


paleolimbot · 2024-03-13T02:40:56Z

@danepitkin @jorisvandenbossche With apologies for the roller coaster of commits today, I think I made it through all the comments. I updated the description up top too, but basically I added a few things like child() and buffer() and iter_buffers() to expose some utility (and make it more clear what the Array is actually supposed to do.

danepitkin · 2024-03-13T19:47:50Z

From a user perspective, I would guess that most all python users want to interact with objects like Array and Scalar, while seeing other objects as implementation details to be avoided, such as CArray and CMaterializedArray.

danepitkin

Wow, this PR is beautiful. I gave it a quick pass and overall love the direction its taken. It generally LGTM, but I'll go back and review the smaller details while giving Joris a chance for a re-review as well.

danepitkin · 2024-03-13T19:51:15Z

python/src/nanoarrow/_lib.pyx

+        return self._arrays[i]
+
+    @property
+    def n_arrays(self):


Suggested change

def n_arrays(self):

def num_arrays(self):

just if we want to be consistent with num_chunks

Hmm, I think num_chunks was removed in a follow up commit already!

I'm pretty sure they're all n_xxx now but will double check! I agree that num_xxx is better, but that's a separate change to consider since we already have CArray.n_children, CArray.n_buffers, CArrayView.n_buffers, CArrayView.n_children, Schema.n_fields, CSchema.n_fields, CLayout.n_buffers, etc.

No worries, if there's already a precedent then you can ignore me. n_* isn't bad either

danepitkin · 2024-03-13T19:52:58Z

python/src/nanoarrow/array.py

+        return Schema(self._data.schema)
+
+    @property
+    def n_buffers(self) -> int:


Suggested change

def n_buffers(self) -> int:

def num_buffers(self) -> int:

You should consider my n vs num suggestion optional, but I do think its best to be consistent. I personally try to avoid as much ambiguity when naming things, hence my minor preference for num.

danepitkin

LGTM! Just a few small optional comments

danepitkin · 2024-03-19T21:24:48Z

python/src/nanoarrow/_lib.pyx

+        cdef int array_i
+        cdef const int64_t* sorted_offsets = <int64_t*>self._array_ends._ptr.data
+
+        if not isinstance(k, slice):


nit: I like to group errors close to the logic e.g.

Suggested change

if not isinstance(k, slice):

if isinstance(k, slice):

raise NotImplementedError("index with slice")

danepitkin · 2024-03-19T21:30:38Z

python/src/nanoarrow/_lib.pyx

+        return self._schema
+
+    @property
+    def array_ends(self):


nit: should array_ends be encapsulated as an implementation detail?

paleolimbot · 2024-03-21T19:00:19Z

I'm going to merge this to move on with subsequent work...there's plenty of time to sort out details and I'm happy to follow up with PRs incorporating any further comments!

This PR implements the `nanoarrow.Array` which basically a `pyarrow.ChunkedArray`. This can represent a `Table`, `RecordBatch`, `ChunkedArray`, and `Array`. It doesn't quite play nicely with pyarrow's ChunkedArray (but will after the next release, since `__arrow_c_stream__` was just added there). The user-facing class is backed by a Cython class, the `CMaterializedArrayStream`, which manages some of the c-level details like resolving a chunk + offset when there is more than one chunk in the array. An early version of this PR implemented the `CMaterializedArrayStream` using C pointers (e.g., `ArrowArray* arrays`), but I decided that was to complex and went back to `List[CArray]`. I think this is also better for managing ownership (e.g., more unneeded `CArray` instances can be released by the garbage collector). The `Array` class as implemented here is device-aware, although until we have non-CPU support it's difficult to test this. The methods I added here are basically stubs just to demonstrate the intention. This PR also implements the `Scalar`, whose main purpose for testing and other non-performance sensitive things (like lazier reprs for very large items or interactive inspection). They may also be useful for working with arrays that contain elements with very long strings or large arrays (e.g., geometry). I also added some basic accessors like `buffer()`, `child()`, and some ways one might want to iterate over an `Array` to make the utility of this class more clear. Basic usage: ```python import nanoarrow as na na.Array(range(100), na.int64()) ``` ``` nanoarrow.Array<int64>[100] 0 1 2 3 4 5 6 7 8 9 ...and 90 more items ``` More involved example reading from an IPC stream: ```python import nanoarrow as na from nanoarrow.ipc import Stream url = "https://github.com/apache/arrow-testing/raw/master/data/arrow-ipc-stream/integration/1.0.0-littleendian/generated_primitive.stream" with Stream.from_url(url) as inp: array = na.Array(inp) array.child(25) ``` ``` nanoarrow.Array<string>[37] 'co矢2p矢m' 'wÂ€acrd' 'kjd1dlô' 'pib矢d5w' '6nnpwôg' 'ndj£h£4' 'ôôf4aµg' 'kwÂh£fr' '°g5dk€e' 'r€cbmdn' ...and 27 more items ``` --------- Co-authored-by: Joris Van den Bossche <[email protected]>

paleolimbot and others added 16 commits March 1, 2024 11:49

first pass at arrays

1d01e20

some interop for the array

ea3d3de

a few tests

7d77d48

check array

cd4ec88

scalar

89af676

refactor the materialized array stream to prepare for scalars

2d3a531

some scalar things

4f5e729

update length method

8c77260

reprs

8627704

make sure gil is aquired in ipc reader

e65c333

fix one more

2c7b8f8

start documenting things

1eb5fba

more docs

1c788b1

simplify materialized array

40b2748

vague device awareness

75236b1

pre-commit

6a80275

paleolimbot commented Mar 6, 2024

View reviewed changes

paleolimbot marked this pull request as ready for review March 6, 2024 18:23

jorisvandenbossche reviewed Mar 7, 2024

View reviewed changes

danepitkin reviewed Mar 7, 2024

View reviewed changes

paleolimbot and others added 2 commits March 8, 2024 14:57

fix iterator for timestamp

cee5458

Update python/src/nanoarrow/array.py

2a8caa6

Co-authored-by: Joris Van den Bossche <[email protected]>

paleolimbot added 4 commits March 11, 2024 12:07

scalar repr

9928651

better repr

c22f5d1

docs

4b8f275

fix pre-commit

21bd84c

paleolimbot added 2 commits March 11, 2024 22:29

iterative chunk resolving

d283736

flake8

92b4187

paleolimbot added 6 commits March 11, 2024 22:37

n_chunks -> num_chunks

33916a4

better names for iterators

7d32b72

rename more iterator

d8d0614

switch back to n_chunks

d7aac4f

stub some accessors

a48fc45

stub child

efc1302

danepitkin reviewed Mar 12, 2024

View reviewed changes

paleolimbot added 9 commits March 12, 2024 16:24

test buffers

351dd2c

test buffers and children

e10781b

more testing

ea48978

more missing coverage

b7c0ed7

doctest

66ba4eb

document user-facing methods

b75b35f

a few more docs

646eceb

remove timezone support

8ed89a9

add test for chunk resolver

3041a02

danepitkin reviewed Mar 13, 2024

View reviewed changes

paleolimbot mentioned this pull request Mar 19, 2024

[Python] Conventions around PyCapsule Interface and choosing Array/Stream export apache/arrow#40648

Open

danepitkin approved these changes Mar 19, 2024

View reviewed changes

suggestions

38398a7

paleolimbot merged commit 7af6dff into apache:main Mar 21, 2024
31 checks passed

paleolimbot deleted the python-user-array branch April 8, 2024 12:19

paleolimbot added this to the nanoarrow 0.5.0 milestone May 22, 2024



		class Scalar:
		"""Generic wrapper around an :class:`Array` element

	if not isinstance(k, slice):
	if isinstance(k, slice):
	raise NotImplementedError("index with slice")

feat(python): Add user-facing Array class #396

feat(python): Add user-facing Array class #396

Conversation

paleolimbot commented Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 7, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danepitkin commented Mar 7, 2024

danepitkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Mar 11, 2024

paleolimbot commented Mar 11, 2024

codecov-commenter commented Mar 12, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

paleolimbot commented Mar 13, 2024

danepitkin commented Mar 13, 2024

danepitkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danepitkin Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danepitkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Mar 21, 2024

feat(python): Add user-facing `Array` class #396

feat(python): Add user-facing `Array` class #396

paleolimbot commented Mar 1, 2024 •

edited

Loading

codecov-commenter commented Mar 12, 2024 •

edited

Loading

danepitkin Mar 14, 2024 •

edited

Loading