Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(r): Provide LinkingTo headers for extension packages #332

Merged
merged 11 commits into from
Jan 8, 2024

Conversation

paleolimbot
Copy link
Member

In developing geoarrow, an R extension that imports/exports a number of Arrow C data structures wrapped by nanoarrow S3 objects, it has become apparent that the sanitize and allocate operations are non-trivial and basically have to be copied in every package that wants to import/export nanoarrow things. @eddelbuettel has run up against some valid use-cases as well! eddelbuettel/linesplitter#1 , #187

This PR makes the definition of how Arrow C Data interface objects are encoded as R external pointers available as a public header (such that downstream packages can LinkingTo: nanoarrow and #include <nanoarrow/r.h>. I think the initial target will just be allocate an owning external pointer and sanitize an input SEXP.

@eddelbuettel Are there any other operations that are blocking any of your projects that would be must-haves in this header?

(I know it's missing array_stream...I forgot about the ability to supply R finalizers and so it's a slightly more complicated change)

@eddelbuettel
Copy link
Contributor

We are making use of Arrow exports from C++ using just the void* pointer. What we have was written before / without nanoarrow but I have started to bring it on the R bindings site. There are a number of things there I'd like to revisit / extend. From my looking around at some of the packages I see using nanoarrow (including adbc* and duckdb) I have the feeling that there are commonalities: those packages may benefit too.

I quite like nanoarrow but (if I may) also find it really frustrating at times because it essentially is three somewhat distinct and non-overlapping parts. First, the vendorable header / c file pair which are great. Then the R package which is nice for some things from R but which, frustratingly does not (did not?) make any tools available for re-use in conjunction with the vendorable part. And then there is the budding Python package which may one day cover what the R package does today, and also provide object creation help.

Now this PR looks promising/ I will look more closely at it and then release 0.4.0. This clearly seems to move in the right direction so three cheers for that! If you want to brainstorm / chat for a few minutes I am sure we can find some time around the two hours of timezone difference.

@codecov-commenter
Copy link

codecov-commenter commented Dec 5, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (38b5201) 88.08% compared to head (8827bdb) 88.08%.

Files Patch % Lines
r/inst/include/nanoarrow/r.h 90.90% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #332      +/-   ##
==========================================
- Coverage   88.08%   88.08%   -0.01%     
==========================================
  Files          72       68       -4     
  Lines       11677    11583      -94     
==========================================
- Hits        10286    10203      -83     
+ Misses       1391     1380      -11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eddelbuettel
Copy link
Contributor

FYI I just checked in on this. Rebases nearly perfectly (one edit of a comment) and the clang-format call is simple too.

@paleolimbot
Copy link
Member Author

Thanks for checking in! This will be included in nanoarrow 0.4 (release in early January) but I want to add documentation to the header before merging since it would effectively freeze the R representation of these three structures. I am going to mark it as "experimental": while I don't envision it changing, I do want to leave some time for others to try and comment.

@eddelbuettel
Copy link
Contributor

No problem. My curiosity was triggered when I saw the ❌ there but it is really just clang-format being itself on one line.

@paleolimbot paleolimbot force-pushed the r-linking-to-headers branch from fba765f to 5886751 Compare January 3, 2024 18:39
@eddelbuettel
Copy link
Contributor

It needs a one-liner from clang-format.

@paleolimbot
Copy link
Member Author

@eddelbuettel Can you take a look at this to ensure that what I've described matches your expectations?

/// The "tag" of an external pointer to an ArrowArray must be R_NilValue or an external
/// pointer to an ArrowSchema that may be used to interpret the pointed-to ArrowArray. The
/// "tag" of a nanoarrow external pointer to an ArrowSchema or ArrowArrayStream is
/// reserved for future use and must be R_NilValue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the above is quite good and helpful, and I wish I would have had that earlier.

Some of it is still a little vague (ie if this is different and distinguishable from Arrow, why is that so and where are some example use cases -- my use of nanoarrow has been to 'be like Arrow' without requiring and I do not seem alone in this).

Let me chew over it some more -- but it looks like a nice step forward. So thanks for that!

@eddelbuettel
Copy link
Contributor

The (arguably very minimal) test of rewriting linesplitter to use nanoarrow/r.h via LinkingTo in lieu of a vendored nanoarrow.hpp passes so that is good.

@eddelbuettel
Copy link
Contributor

eddelbuettel commented Jan 8, 2024

Switched a few more local copied / variants to what nanoarrow/r.h now exposes and it is mostly good. So I'd say you can merge this and we can always refine.

I do have one test function though where I currently do

    auto sxpsch = nanoarrow_schema_owning_xptr();      				// allocates and wraps as XP
    auto sch = nanoarrow_schema_from_xptr(sxpsch);
    ArrowSchemaInitFromType(sch, NANOARROW_TYPE_STRING);

that ends in error with nanoarrow_schema() has already been released. Have not poked much yet.

@paleolimbot
Copy link
Member Author

That's a good point...the schema_from_xptr() is intended for sanitizing input. Output arguments should just use R_ExternalPtrAddr() (or I'll add a wrapper shortly).

@eddelbuettel
Copy link
Contributor

eddelbuettel commented Jan 8, 2024

Hah! These subtle 'documentation bugs' are a wee bit tricky and one of the reasons I <cough, cough> blegged you to chip in with a reference example for something as simple as the linesplitter repo.

I still have high hopes for all this as it moves in the right direction. One goal is to be able to start from a pair of nanoarrow::UniqueArray and nanoarrow::UniqueSchema, operate on them in C++ and then towards the end return them to R via R-specific helpers ... and do the same to Python with Python-specific helpers. I know Python is behind but some small steps would help me convince my colleagues that this is a bet we can make in a mulitlingual repo.

@paleolimbot
Copy link
Member Author

Python is scheduled for this quarter and should have very similar tools to what's here! Notably, the initial version will provide the ability to implement the capsule interface, which is basically what this PR documents somewhat less officially for R ( https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html ).

I agree that the intended pattern is starting with one or more nanoarrow:UniqueXXX objects and use ArrowXXXMove() to move them to caller-allocated output structures.

@paleolimbot paleolimbot merged commit 34b0ca5 into apache:main Jan 8, 2024
13 checks passed
@paleolimbot paleolimbot deleted the r-linking-to-headers branch January 8, 2024 15:11
@paleolimbot paleolimbot added this to the nanoarrow 0.4.0 milestone Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants