-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Add user-facing Array
class
#396
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are because Array(Stream.from_url(...))
crashed due to these methods not managing the GIL properly. I can move these to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think leaving it in is fine!
If it would at some point turn out to be a perf bottleneck to use an untyped list, we can also consider compiling the cython code to C++ instead of C and then can use things like vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Some general thoughts:
-
I think I certainly like having a "container" Array class that hides the complexity of Array vs ChunkedArray like we have in pyarrow. For certain use cases, that is definitely a lot easier to work with.
-
But one downside of this is that there still isn't an easy to use but still low-level "ArrowArray" class that combines some features from
CArray
andCArrayView
. For other use cases, it would be nice to dig into the object you have, check the buffers, understand what data you got, etc. Right now the Array class still gives you somewhat access to the CArray (see bullet point below as well), but when getting that you still need to dona.c_array_view(c_arr)
to then get access to eg the Buffers. -
I am the least sure (or enthusiastic, that's probably the better word ;)) about adding a Scalar class. Probably there is no way around needing it, if you want to be able to represent single values while adhering to the C interface type system. But it would also be nice to keep it as limited as possible (for example, do we think someone should ever iterate over all values of the Array as Scalar objects?)
-
Do we still want to expose the underlying CArrays a bit more publicly? Right now I can get them with
list(arr._data.arrays)
, but we should maybe make that public (maybe adata
attribute that returns that list? Although "data" is of course a very generic term ;))
|
||
|
||
class Scalar: | ||
"""Generic wrapper around an :class:`Array` element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the scalar class implement the __arrow_c_array/schema__
dunders? (like the underlying CScalar does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And is it needed to have an additional class that wraps CScalar?
It feels as some unnecessary complexity, but I assume there are some things that need to be done in cython (the __arrow_c_array__
impl), while some other things here require the python classes (eg Schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I didn't think of it when I wrote this, but the CScalar
probably should not implement __arrow_c_array__
because it is not necessarily representing a CPU array. To me the main utility of the Scalar class is on the interactive end of things (where the non-Cython wrapper gives a better doc/IDE experience).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...then the CArray
would face the same problem and it would be very strange if the CArray
did not implement __arrow_c_array__
. I'll add it to the Scalar
for now...we probably need the device_id
and device_type
in the CArray
🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call...I removed the CScalar!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I didn't think of it when I wrote this, but the
CScalar
probably should not implement__arrow_c_array__
because it is not necessarily representing a CPU array.
Yeah, it's true that things are not fully correct regarding devices right now, but I think that is something we can tackle later. i.e. when there is better device support, we would add __arrow_c_device_array__
in addition, which always works, and let __arrow_c_array__
raise an error or something like that.
The same happens in pyarrow actually. A pyarrow.Array can right now be backed by non-CPU data, but we still have a _arrow_c_array__
defined on it.
python/src/nanoarrow/array.py
Outdated
2 | ||
3 | ||
""" | ||
return Array(obj, schema=schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the array
function is essentially just an alias for Array
, there is not that much value to it. But I assume the intent is that later this constructor can be expanded?
For example, I assumed that I would be able to do something like
na.array([1,2,3])
but that doesn't seem to work (yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the error message is a bit confusing:
In [36]: na.array([1,2,3])
...
TypeError: Can't convert object of type list to nanoarrow.c_array_stream or nanoarrow.c_array
because the above actually does work when specifying a type:
In [35]: na.array([1,2,3], na.int64())
Out[35]:
Array<INT64>[3]
1
2
3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modelled this after the fact that we have na.schema()
. I would be game to drop na.schema()
and na.array()
in favour of the regular class constructor!
Good call on the error!
python/src/nanoarrow/_lib.pyx
Outdated
|
||
cdef int _resolve_chunk(self, const int64_t* sorted_offsets, int64_t index, int64_t start_offset_i, | ||
int64_t end_offset_i) noexcept nogil: | ||
if start_offset_i >= (end_offset_i - 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit of documentation here about the keywords?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: i challenge you to implement this iteratively instead of recursively to save on stack space usage ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! Just leaving myself a mental note that the Arrow C++ implementation is here: https://github.com/apache/arrow/blob/1eb46f763a73d313466fdc895eae1f35fac37945/cpp/src/arrow/chunk_resolver.h#L133-L162
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is now ArrowResolveChunk64()
in the C library. It will also be needed for random access into REE arrays and I'd like to use it in the R package to do something similar to this PR in the R package.
python/src/nanoarrow/_lib.pyx
Outdated
kint = k | ||
if kint < 0: | ||
kint += self._ptr.length | ||
return CScalar(self, kint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for CArray? We could also leave this for just the higher-level Array object to give scalars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! The CScalar is no more!
This idea came out of #397 where I suggested having types implement their own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think leaving it in is fine!
python/src/nanoarrow/_ipc_lib.pyx
Outdated
return self._obj | ||
|
||
@property | ||
def close_stream(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this object hold close_stream
state? Do we not want py_input_stream_release
to always close the stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It currently only closes a file object if the higher level Python wrapping this was the code that opened it (e.g., it won't close a connection that a user opened in case the user had other plans or the IPC data was only part of the file payload).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably this should be close_file_obj
to disambiguate the three types of streams we have going on here (ArrowArray, ArrowIpcInput, Python file object).
python/src/nanoarrow/_lib.pyx
Outdated
|
||
This is an implementation of a "Scalar" that is a thin | ||
wrapper around a (Python) CArray with the ability to | ||
materialize a length-one array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever! :)
python/src/nanoarrow/_lib.pyx
Outdated
|
||
cdef int _resolve_chunk(self, const int64_t* sorted_offsets, int64_t index, int64_t start_offset_i, | ||
int64_t end_offset_i) noexcept nogil: | ||
if start_offset_i >= (end_offset_i - 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: i challenge you to implement this iteratively instead of recursively to save on stack space usage ;)
python/src/nanoarrow/array.py
Outdated
return next(iterator(self._c_scalar)) | ||
|
||
def __repr__(self) -> str: | ||
width_hint = 80 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the width_hint configureable via an additional api such as to_string()
? Then repr
can call self.to_string(length=80)
python/src/nanoarrow/array.py
Outdated
return Schema(self._data.schema) | ||
|
||
@property | ||
def n_chunks(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer num_chunks
for naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! That's the pa.ChunkedArray
name too 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back to n_chunks
here because everywhere else in the package uses n_xxxxx
. I can't revert some of those because they're the same as the ArrowArray
/ArrowSchema
C structs and it would be strange if they were different 🤷
python/src/nanoarrow/array.py
Outdated
return self._data.n_arrays | ||
|
||
@property | ||
def chunks(self) -> Iterable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think iter_chunks()
could be a better name and is better defined as a function. I typically think of properties as a static value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better or worse, nanoarrow is pretty deep in the "properties return iterators" pattern (e.g., schema.children
and many others). pyarrow's ChunkedArray has a chunks
property that returns a list()
rather than an iterator, which was the vague inspiration. For now maybe I'll keep this returning an iterator but I'm definitely down to circle back and fix all the existing usage since I agree that it's a potentially confusing use of the @property
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just kidding...good call! I organized all the fun ways you can iterate over an Array
into a bunch of iter_xxx()
methods, including iter_chunks()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better or worse, nanoarrow is pretty deep in the "properties return iterators" pattern (e.g.,
schema.children
and many others). pyarrow's ChunkedArray has achunks
property that returns alist()
rather than an iterator, which was the vague inspiration.
Is there a performance reason that you prefer to work with iterators?
In most of those cases, the data already exists (most what might happen is that it gets wrapped in a new class), and so all those things should be pretty cheap?
If that's the case, I think it would actually be nice to keep the properties but let them just return a list.
Co-authored-by: Joris Van den Bossche <[email protected]>
I tried very hard to avoid it because I also highly dislike whenever I arrive at a Arrow C++/pyarrow's scalar. I think once the scalar has some more functionality (e.g., getting struct children, or list elements) it will be a little more clear. The main thing I wanted to use it for, as Dane mentioned, is an abbreviated repr implementation that wasn't severely misusing iterators; however, I think they are (or will be) useful in any context where somebody is interactively exploring something.
Not really, no. I suppose |
(Sorry for the spam, I keep finding things I haven't responded to!)
There's definitely room to expand any one of the |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #396 +/- ##
==========================================
+ Coverage 88.74% 88.75% +0.01%
==========================================
Files 81 82 +1
Lines 14398 14634 +236
==========================================
+ Hits 12778 12989 +211
- Misses 1620 1645 +25 ☔ View full report in Codecov by Sentry. |
@@ -28,6 +28,28 @@ | |||
extern "C" { | |||
#endif | |||
|
|||
// Modified from Arrow C++ (1eb46f76) cpp/src/arrow/chunk_resolver.h#L133-L162 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@danepitkin @jorisvandenbossche With apologies for the roller coaster of commits today, I think I made it through all the comments. I updated the description up top too, but basically I added a few things like |
From a user perspective, I would guess that most all python users want to interact with objects like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this PR is beautiful. I gave it a quick pass and overall love the direction its taken. It generally LGTM, but I'll go back and review the smaller details while giving Joris a chance for a re-review as well.
return self._arrays[i] | ||
|
||
@property | ||
def n_arrays(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def n_arrays(self): | |
def num_arrays(self): |
just if we want to be consistent with num_chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think num_chunks was removed in a follow up commit already!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure they're all n_xxx
now but will double check! I agree that num_xxx
is better, but that's a separate change to consider since we already have CArray.n_children
, CArray.n_buffers
, CArrayView.n_buffers
, CArrayView.n_children
, Schema.n_fields
, CSchema.n_fields
, CLayout.n_buffers
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, if there's already a precedent then you can ignore me. n_* isn't bad either
return Schema(self._data.schema) | ||
|
||
@property | ||
def n_buffers(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def n_buffers(self) -> int: | |
def num_buffers(self) -> int: |
You should consider my n
vs num
suggestion optional, but I do think its best to be consistent. I personally try to avoid as much ambiguity when naming things, hence my minor preference for num
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a few small optional comments
python/src/nanoarrow/_lib.pyx
Outdated
cdef int array_i | ||
cdef const int64_t* sorted_offsets = <int64_t*>self._array_ends._ptr.data | ||
|
||
if not isinstance(k, slice): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I like to group errors close to the logic e.g.
if not isinstance(k, slice): | |
if isinstance(k, slice): | |
raise NotImplementedError("index with slice") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
python/src/nanoarrow/_lib.pyx
Outdated
return self._schema | ||
|
||
@property | ||
def array_ends(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should array_ends
be encapsulated as an implementation detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
I'm going to merge this to move on with subsequent work...there's plenty of time to sort out details and I'm happy to follow up with PRs incorporating any further comments! |
This PR implements the `nanoarrow.Array` which basically a `pyarrow.ChunkedArray`. This can represent a `Table`, `RecordBatch`, `ChunkedArray`, and `Array`. It doesn't quite play nicely with pyarrow's ChunkedArray (but will after the next release, since `__arrow_c_stream__` was just added there). The user-facing class is backed by a Cython class, the `CMaterializedArrayStream`, which manages some of the c-level details like resolving a chunk + offset when there is more than one chunk in the array. An early version of this PR implemented the `CMaterializedArrayStream` using C pointers (e.g., `ArrowArray* arrays`), but I decided that was to complex and went back to `List[CArray]`. I think this is also better for managing ownership (e.g., more unneeded `CArray` instances can be released by the garbage collector). The `Array` class as implemented here is device-aware, although until we have non-CPU support it's difficult to test this. The methods I added here are basically stubs just to demonstrate the intention. This PR also implements the `Scalar`, whose main purpose for testing and other non-performance sensitive things (like lazier reprs for very large items or interactive inspection). They may also be useful for working with arrays that contain elements with very long strings or large arrays (e.g., geometry). I also added some basic accessors like `buffer()`, `child()`, and some ways one might want to iterate over an `Array` to make the utility of this class more clear. Basic usage: ```python import nanoarrow as na na.Array(range(100), na.int64()) ``` ``` nanoarrow.Array<int64>[100] 0 1 2 3 4 5 6 7 8 9 ...and 90 more items ``` More involved example reading from an IPC stream: ```python import nanoarrow as na from nanoarrow.ipc import Stream url = "https://github.com/apache/arrow-testing/raw/master/data/arrow-ipc-stream/integration/1.0.0-littleendian/generated_primitive.stream" with Stream.from_url(url) as inp: array = na.Array(inp) array.child(25) ``` ``` nanoarrow.Array<string>[37] 'co矢2p矢m' 'w€acrd' 'kjd1dlô' 'pib矢d5w' '6nnpwôg' 'ndj£h£4' 'ôôf4aµg' 'kwÂh£fr' '°g5dk€e' 'r€cbmdn' ...and 27 more items ``` --------- Co-authored-by: Joris Van den Bossche <[email protected]>
This PR implements the
nanoarrow.Array
which basically apyarrow.ChunkedArray
. This can represent aTable
,RecordBatch
,ChunkedArray
, andArray
. It doesn't quite play nicely with pyarrow's ChunkedArray (but will after the next release, since__arrow_c_stream__
was just added there).The user-facing class is backed by a Cython class, the
CMaterializedArrayStream
, which manages some of the c-level details like resolving a chunk + offset when there is more than one chunk in the array. An early version of this PR implemented theCMaterializedArrayStream
using C pointers (e.g.,ArrowArray* arrays
), but I decided that was to complex and went back toList[CArray]
. I think this is also better for managing ownership (e.g., more unneededCArray
instances can be released by the garbage collector).The
Array
class as implemented here is device-aware, although until we have non-CPU support it's difficult to test this. The methods I added here are basically stubs just to demonstrate the intention.This PR also implements the
Scalar
, whose main purpose for testing and other non-performance sensitive things (like lazier reprs for very large items or interactive inspection). They may also be useful for working with arrays that contain elements with very long strings or large arrays (e.g., geometry).I also added some basic accessors like
buffer()
,child()
, and some ways one might want to iterate over anArray
to make the utility of this class more clear.Basic usage:
More involved example reading from an IPC stream: