-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(r): Implement extension type registration/conversion #288
Conversation
Codecov Report
@@ Coverage Diff @@
## main #288 +/- ##
==========================================
- Coverage 86.97% 86.79% -0.19%
==========================================
Files 66 64 -2
Lines 10322 9981 -341
==========================================
- Hits 8978 8663 -315
+ Misses 1344 1318 -26
... and 3 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…ert_array() everywhere.
@krlmlr I hate to ask for your time again; however, this PR would definitely benefit from your 👀 if you have the bandwidth to take a look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, very exciting!
I only skimmed over it. I wonder if it's worth splitting out some of the refactorings and merging them individually.
schema <- .Call(nanoarrow_c_infer_schema_array, array) | ||
parsed <- .Call(nanoarrow_c_schema_parse, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this seems to throw away schema
, perhaps offer a C entry point that combines both functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good observation that exposes a larger design flaw in the current system: at the point that convert_array()
is called, we pretty much need the parsed version of a schema (i.e., ArrowSchemaView
) and the parsed version of the array (i.e., ArrowArrayView
). I wrote the Python bindings to handle this from the ground up but here there are a number of hacks like this one (and more than a few in C). I'm going to try to fix all of those in a more systemic way (but don't quite have the time to do it yet).
r/R/convert-array.R
Outdated
if (is.data.frame(values)) { | ||
return(values[indices + 1L, , drop = FALSE]) | ||
} else { | ||
return(values[indices + 1L]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is vec_slice()
:
if (is.data.frame(values)) { | |
return(values[indices + 1L, , drop = FALSE]) | |
} else { | |
return(values[indices + 1L]) | |
} | |
return(vec_slice(values, indices + 1L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also exposes a larger design question...currently, nanoarrow does not require vctrs for most data types (it does require it for converting lists-of-things into R vectors, and I think this is a good thing). I will extract a utility function so that it is more obvious what this does/easier to switch if we take on a vctrs dependency.
spec <- resolve_nanoarrow_extension(parsed$extension_name) | ||
return(convert_array_extension(spec, array, to, ...)) | ||
} | ||
|
||
# Handle default dictionary conversion since it's the same for all types | ||
dictionary <- array$dictionary | ||
|
||
if (!is.null(dictionary)) { | ||
values <- .Call(nanoarrow_c_convert_array, dictionary, to) | ||
array$dictionary <- NULL | ||
indices <- .Call(nanoarrow_c_convert_array, array, integer()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps enable this function to add an offset directly, to avoid the + 1L
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is definitely a good optimization...ideally there would be an offset + scale for all numerics (scale is used for timestamp and difftime conversion). I would like to move the conversion process such to something like get_converter(array, to)
returning something that could be constructed like int_vector_converter(offset = 1)
. I'm currently trying to (for better or possibly worse) provide full type coverage (and associated test coverage), at the expense of speed for the less frequently used types.
} | ||
break; | ||
default: | ||
Rf_error("Unhandled SEXP type in copy_vec_into()"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPLXSXP
?
posixlt = as.POSIXlt("2000-01-01 12:23", tz = "UTC"), | ||
date = as.Date("2000-01-01"), | ||
difftime = as.difftime(123, units = "secs"), | ||
data_frame_simple = tibble::tibble(x = 1:5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you support packed columns (structs) here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (I think what you're referring to is tested on the next line?)
data_frame_nested = tibble::tibble(x = 1:5, y = tibble::tibble(z = letters[1:5])) | ||
) | ||
|
||
for (nm in names(vectors)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Experience from DBItest has taught me that you'll be much better off writing multiple test_that(...)
blocks, one per element. Why not create a small script that generates this code for you?
Canonical reference: https://mtlynch.io/good-developers-bad-tests/ .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point...I wish testthat had better support for parameterized tests or SCOPED_TRACE()
like in GTest. I will see if there is any way to restructure this to improve reporting.
stream <- basic_array_stream(list(array, array)) | ||
expect_identical(convert_array_stream(stream), vctrs::vec_rep(vctr, 2)) | ||
|
||
if (requireNamespace("arrow", quietly = TRUE)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also allow you to use skip_if_not_installed("arrow")
here.
) | ||
on.exit(unregister_nanoarrow_extension("some_ext")) | ||
|
||
infer_nanoarrow_ptype_extension.some_spec_class <- function(spec, x, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's rlang::local_bindings()
.
Co-authored-by: Kirill Müller <[email protected]>
Co-authored-by: Kirill Müller <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
This PR implements extension type registration, enabling nanoarrow or other R packages to register conversions to and from R vectors. It also implements the "vctrs" extension type, which is also implemented in the Arrow R package: this extension type allows any R vector whose storage type is supported by nanoarrow to be roundtripped through an Arrow array. It's also a good test for the extension type mechanism since it has diverse storage and R vector type requirements.
This required some changes to the conversion system: when calling
as_nanoarrow_array(x, type = some_extension_type())
, the extension needs to have the first chance at handling the conversion. When callingconvert_array(some_extension_array)
, similarly, it needs to have a chance to perform the conversion. The conversion process in both directions has become sufficiently complicated that I think it may need some refactoring; however, I think the intended logic and test coverage is there in this PR.Created on 2023-08-29 with reprex v2.0.2