Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add user-facing Array class #396

Merged
merged 46 commits into from
Mar 21, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Mar 1, 2024

This PR implements the nanoarrow.Array which basically a pyarrow.ChunkedArray. This can represent a Table, RecordBatch, ChunkedArray, and Array. It doesn't quite play nicely with pyarrow's ChunkedArray (but will after the next release, since __arrow_c_stream__ was just added there).

The user-facing class is backed by a Cython class, the CMaterializedArrayStream, which manages some of the c-level details like resolving a chunk + offset when there is more than one chunk in the array. An early version of this PR implemented the CMaterializedArrayStream using C pointers (e.g., ArrowArray* arrays), but I decided that was to complex and went back to List[CArray]. I think this is also better for managing ownership (e.g., more unneeded CArray instances can be released by the garbage collector).

The Array class as implemented here is device-aware, although until we have non-CPU support it's difficult to test this. The methods I added here are basically stubs just to demonstrate the intention.

This PR also implements the Scalar, whose main purpose for testing and other non-performance sensitive things (like lazier reprs for very large items or interactive inspection). They may also be useful for working with arrays that contain elements with very long strings or large arrays (e.g., geometry).

I also added some basic accessors like buffer(), child(), and some ways one might want to iterate over an Array to make the utility of this class more clear.

Basic usage:

import nanoarrow as na

na.Array(range(100), na.int64())
nanoarrow.Array<int64>[100]
0
1
2
3
4
5
6
7
8
9
...and 90 more items

More involved example reading from an IPC stream:

import nanoarrow as na
from nanoarrow.ipc import Stream

url = "https://github.com/apache/arrow-testing/raw/master/data/arrow-ipc-stream/integration/1.0.0-littleendian/generated_primitive.stream"

with Stream.from_url(url) as inp:
    array = na.Array(inp)

array.child(25)
nanoarrow.Array<string>[37]
'co矢2p矢m'
'w€acrd'
'kjd1dlô'
'pib矢d5w'
'6nnpwôg'
'ndj£h£4'
'ôôf4aµg'
'kwÂh£fr'
'°g5dk€e'
'r€cbmdn'
...and 27 more items

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are because Array(Stream.from_url(...)) crashed due to these methods not managing the GIL properly. I can move these to a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think leaving it in is fine!

@paleolimbot paleolimbot marked this pull request as ready for review March 6, 2024 18:23
@jorisvandenbossche
Copy link
Member

An early version of this PR implemented the CMaterializedArrayStream using C pointers (e.g., ArrowArray* arrays), but I decided that was to complex and went back to List[CArray]. I think this is also better for managing ownership (e.g., more unneeded CArray instances can be released by the garbage collector).

If it would at some point turn out to be a perf bottleneck to use an untyped list, we can also consider compiling the cython code to C++ instead of C and then can use things like vector

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
Some general thoughts:

  • I think I certainly like having a "container" Array class that hides the complexity of Array vs ChunkedArray like we have in pyarrow. For certain use cases, that is definitely a lot easier to work with.

  • But one downside of this is that there still isn't an easy to use but still low-level "ArrowArray" class that combines some features from CArray and CArrayView. For other use cases, it would be nice to dig into the object you have, check the buffers, understand what data you got, etc. Right now the Array class still gives you somewhat access to the CArray (see bullet point below as well), but when getting that you still need to do na.c_array_view(c_arr) to then get access to eg the Buffers.

  • I am the least sure (or enthusiastic, that's probably the better word ;)) about adding a Scalar class. Probably there is no way around needing it, if you want to be able to represent single values while adhering to the C interface type system. But it would also be nice to keep it as limited as possible (for example, do we think someone should ever iterate over all values of the Array as Scalar objects?)

  • Do we still want to expose the underlying CArrays a bit more publicly? Right now I can get them with list(arr._data.arrays), but we should maybe make that public (maybe a data attribute that returns that list? Although "data" is of course a very generic term ;))



class Scalar:
"""Generic wrapper around an :class:`Array` element
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the scalar class implement the __arrow_c_array/schema__ dunders? (like the underlying CScalar does)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And is it needed to have an additional class that wraps CScalar?
It feels as some unnecessary complexity, but I assume there are some things that need to be done in cython (the __arrow_c_array__ impl), while some other things here require the python classes (eg Schema).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I didn't think of it when I wrote this, but the CScalar probably should not implement __arrow_c_array__ because it is not necessarily representing a CPU array. To me the main utility of the Scalar class is on the interactive end of things (where the non-Cython wrapper gives a better doc/IDE experience).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...then the CArray would face the same problem and it would be very strange if the CArray did not implement __arrow_c_array__. I'll add it to the Scalar for now...we probably need the device_id and device_type in the CArray 🙁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call...I removed the CScalar!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I didn't think of it when I wrote this, but the CScalar probably should not implement __arrow_c_array__ because it is not necessarily representing a CPU array.

Yeah, it's true that things are not fully correct regarding devices right now, but I think that is something we can tackle later. i.e. when there is better device support, we would add __arrow_c_device_array__ in addition, which always works, and let __arrow_c_array__ raise an error or something like that.

The same happens in pyarrow actually. A pyarrow.Array can right now be backed by non-CPU data, but we still have a _arrow_c_array__ defined on it.

2
3
"""
return Array(obj, schema=schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the array function is essentially just an alias for Array, there is not that much value to it. But I assume the intent is that later this constructor can be expanded?

For example, I assumed that I would be able to do something like

na.array([1,2,3])

but that doesn't seem to work (yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the error message is a bit confusing:

In [36]: na.array([1,2,3])
...
TypeError: Can't convert object of type list to nanoarrow.c_array_stream or nanoarrow.c_array

because the above actually does work when specifying a type:

In [35]: na.array([1,2,3], na.int64())
Out[35]: 
Array<INT64>[3]
1
2
3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modelled this after the fact that we have na.schema(). I would be game to drop na.schema() and na.array() in favour of the regular class constructor!

Good call on the error!


cdef int _resolve_chunk(self, const int64_t* sorted_offsets, int64_t index, int64_t start_offset_i,
int64_t end_offset_i) noexcept nogil:
if start_offset_i >= (end_offset_i - 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit of documentation here about the keywords?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: i challenge you to implement this iteratively instead of recursively to save on stack space usage ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Just leaving myself a mental note that the Arrow C++ implementation is here: https://github.com/apache/arrow/blob/1eb46f763a73d313466fdc895eae1f35fac37945/cpp/src/arrow/chunk_resolver.h#L133-L162

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is now ArrowResolveChunk64() in the C library. It will also be needed for random access into REE arrays and I'd like to use it in the R package to do something similar to this PR in the R package.

kint = k
if kint < 0:
kint += self._ptr.length
return CScalar(self, kint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for CArray? We could also leave this for just the higher-level Array object to give scalars?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! The CScalar is no more!

@danepitkin
Copy link
Member

I am the least sure (or enthusiastic, that's probably the better word ;)) about adding a Scalar class. Probably there is no way around needing it, if you want to be able to represent single values while adhering to the C interface type system. But it would also be nice to keep it as limited as possible (for example, do we think someone should ever iterate over all values of the Array as Scalar objects?)

This idea came out of #397 where I suggested having types implement their own reprs vs having a top-down function that introspects how to represent types. Does this feel like a good use case for the Scalar class?

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think leaving it in is fine!

return self._obj

@property
def close_stream(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this object hold close_stream state? Do we not want py_input_stream_release to always close the stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently only closes a file object if the higher level Python wrapping this was the code that opened it (e.g., it won't close a connection that a user opened in case the user had other plans or the IPC data was only part of the file payload).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this should be close_file_obj to disambiguate the three types of streams we have going on here (ArrowArray, ArrowIpcInput, Python file object).


This is an implementation of a "Scalar" that is a thin
wrapper around a (Python) CArray with the ability to
materialize a length-one array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever! :)


cdef int _resolve_chunk(self, const int64_t* sorted_offsets, int64_t index, int64_t start_offset_i,
int64_t end_offset_i) noexcept nogil:
if start_offset_i >= (end_offset_i - 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: i challenge you to implement this iteratively instead of recursively to save on stack space usage ;)

return next(iterator(self._c_scalar))

def __repr__(self) -> str:
width_hint = 80
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the width_hint configureable via an additional api such as to_string()? Then repr can call self.to_string(length=80)

return Schema(self._data.schema)

@property
def n_chunks(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer num_chunks for naming

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! That's the pa.ChunkedArray name too 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to n_chunks here because everywhere else in the package uses n_xxxxx. I can't revert some of those because they're the same as the ArrowArray/ArrowSchema C structs and it would be strange if they were different 🤷

return self._data.n_arrays

@property
def chunks(self) -> Iterable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think iter_chunks() could be a better name and is better defined as a function. I typically think of properties as a static value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, nanoarrow is pretty deep in the "properties return iterators" pattern (e.g., schema.children and many others). pyarrow's ChunkedArray has a chunks property that returns a list() rather than an iterator, which was the vague inspiration. For now maybe I'll keep this returning an iterator but I'm definitely down to circle back and fix all the existing usage since I agree that it's a potentially confusing use of the @property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just kidding...good call! I organized all the fun ways you can iterate over an Array into a bunch of iter_xxx() methods, including iter_chunks().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, nanoarrow is pretty deep in the "properties return iterators" pattern (e.g., schema.children and many others). pyarrow's ChunkedArray has a chunks property that returns a list() rather than an iterator, which was the vague inspiration.

Is there a performance reason that you prefer to work with iterators?
In most of those cases, the data already exists (most what might happen is that it gets wrapped in a new class), and so all those things should be pretty cheap?

If that's the case, I think it would actually be nice to keep the properties but let them just return a list.

@paleolimbot
Copy link
Member Author

I am the least sure (or enthusiastic, that's probably the better word ;)) about adding a Scalar class.

I tried very hard to avoid it because I also highly dislike whenever I arrive at a Arrow C++/pyarrow's scalar. I think once the scalar has some more functionality (e.g., getting struct children, or list elements) it will be a little more clear. The main thing I wanted to use it for, as Dane mentioned, is an abbreviated repr implementation that wasn't severely misusing iterators; however, I think they are (or will be) useful in any context where somebody is interactively exploring something.

do we think someone should ever iterate over all values of the Array as Scalar objects?

Not really, no. I suppose __iter__ could error with a message listing some more useful ways in which one could iterate over an Array (chunks, scalars, python objects, buffers). That would keep the concept but prevent things like list(array) from doing something slow and not very useful.

@paleolimbot
Copy link
Member Author

(Sorry for the spam, I keep finding things I haven't responded to!)

But one downside of this is that there still isn't an easy to use but still low-level "ArrowArray"

There's definitely room to expand any one of the Array, the CArray, or the CArrayView to fit what people actually want to do with nanoarrow in Python. For me the CArrayView is more of an internal tool to reduce the amount of C anybody needs to write to expose iterators...it's best used if you instantiate it once and reuse it for all the chunks in your array. I'll workshop some buffer iteration and see if that solves at least part of your desire for introspecting the data!

@codecov-commenter
Copy link

codecov-commenter commented Mar 12, 2024

Codecov Report

Attention: Patch coverage is 90.37037% with 26 lines in your changes are missing coverage. Please review.

Project coverage is 88.75%. Comparing base (7cf50a3) to head (3041a02).
Report is 3 commits behind head on main.

Files Patch % Lines
python/src/nanoarrow/_ipc_lib.pyx 38.70% 19 Missing ⚠️
python/src/nanoarrow/_lib.pyx 92.95% 5 Missing ⚠️
python/src/nanoarrow/array.py 99.30% 1 Missing ⚠️
python/src/nanoarrow/iterator.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #396      +/-   ##
==========================================
+ Coverage   88.74%   88.75%   +0.01%     
==========================================
  Files          81       82       +1     
  Lines       14398    14634     +236     
==========================================
+ Hits        12778    12989     +211     
- Misses       1620     1645      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -28,6 +28,28 @@
extern "C" {
#endif

// Modified from Arrow C++ (1eb46f76) cpp/src/arrow/chunk_resolver.h#L133-L162
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@paleolimbot
Copy link
Member Author

@danepitkin @jorisvandenbossche With apologies for the roller coaster of commits today, I think I made it through all the comments. I updated the description up top too, but basically I added a few things like child() and buffer() and iter_buffers() to expose some utility (and make it more clear what the Array is actually supposed to do.

@danepitkin
Copy link
Member

From a user perspective, I would guess that most all python users want to interact with objects like Array and Scalar, while seeing other objects as implementation details to be avoided, such as CArray and CMaterializedArray.

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this PR is beautiful. I gave it a quick pass and overall love the direction its taken. It generally LGTM, but I'll go back and review the smaller details while giving Joris a chance for a re-review as well.

return self._arrays[i]

@property
def n_arrays(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def n_arrays(self):
def num_arrays(self):

just if we want to be consistent with num_chunks

Copy link
Member

@danepitkin danepitkin Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think num_chunks was removed in a follow up commit already!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure they're all n_xxx now but will double check! I agree that num_xxx is better, but that's a separate change to consider since we already have CArray.n_children, CArray.n_buffers, CArrayView.n_buffers, CArrayView.n_children, Schema.n_fields, CSchema.n_fields, CLayout.n_buffers, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, if there's already a precedent then you can ignore me. n_* isn't bad either

return Schema(self._data.schema)

@property
def n_buffers(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def n_buffers(self) -> int:
def num_buffers(self) -> int:

You should consider my n vs num suggestion optional, but I do think its best to be consistent. I personally try to avoid as much ambiguity when naming things, hence my minor preference for num.

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a few small optional comments

cdef int array_i
cdef const int64_t* sorted_offsets = <int64_t*>self._array_ends._ptr.data

if not isinstance(k, slice):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like to group errors close to the logic e.g.

Suggested change
if not isinstance(k, slice):
if isinstance(k, slice):
raise NotImplementedError("index with slice")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

return self._schema

@property
def array_ends(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should array_ends be encapsulated as an implementation detail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@paleolimbot
Copy link
Member Author

I'm going to merge this to move on with subsequent work...there's plenty of time to sort out details and I'm happy to follow up with PRs incorporating any further comments!

@paleolimbot paleolimbot merged commit 7af6dff into apache:main Mar 21, 2024
31 checks passed
@paleolimbot paleolimbot deleted the python-user-array branch April 8, 2024 12:19
eddelbuettel pushed a commit to eddelbuettel/arrow-nanoarrow that referenced this pull request Apr 10, 2024
This PR implements the `nanoarrow.Array` which basically a
`pyarrow.ChunkedArray`. This can represent a `Table`, `RecordBatch`,
`ChunkedArray`, and `Array`. It doesn't quite play nicely with pyarrow's
ChunkedArray (but will after the next release, since
`__arrow_c_stream__` was just added there).

The user-facing class is backed by a Cython class, the
`CMaterializedArrayStream`, which manages some of the c-level details
like resolving a chunk + offset when there is more than one chunk in the
array. An early version of this PR implemented the
`CMaterializedArrayStream` using C pointers (e.g., `ArrowArray*
arrays`), but I decided that was to complex and went back to
`List[CArray]`. I think this is also better for managing ownership
(e.g., more unneeded `CArray` instances can be released by the garbage
collector).

The `Array` class as implemented here is device-aware, although until we
have non-CPU support it's difficult to test this. The methods I added
here are basically stubs just to demonstrate the intention.

This PR also implements the `Scalar`, whose main purpose for testing and
other non-performance sensitive things (like lazier reprs for very large
items or interactive inspection). They may also be useful for working
with arrays that contain elements with very long strings or large arrays
(e.g., geometry).

I also added some basic accessors like `buffer()`, `child()`, and some
ways one might want to iterate over an `Array` to make the utility of
this class more clear.

Basic usage:

```python
import nanoarrow as na

na.Array(range(100), na.int64())
```

```
nanoarrow.Array<int64>[100]
0
1
2
3
4
5
6
7
8
9
...and 90 more items
```

More involved example reading from an IPC stream:

```python
import nanoarrow as na
from nanoarrow.ipc import Stream

url = "https://github.com/apache/arrow-testing/raw/master/data/arrow-ipc-stream/integration/1.0.0-littleendian/generated_primitive.stream"

with Stream.from_url(url) as inp:
    array = na.Array(inp)

array.child(25)
```

```
nanoarrow.Array<string>[37]
'co矢2p矢m'
'w€acrd'
'kjd1dlô'
'pib矢d5w'
'6nnpwôg'
'ndj£h£4'
'ôôf4aµg'
'kwÂh£fr'
'°g5dk€e'
'r€cbmdn'
...and 27 more items
```

---------

Co-authored-by: Joris Van den Bossche <[email protected]>
@paleolimbot paleolimbot added this to the nanoarrow 0.5.0 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants