feat(r): Implement extension type registration/conversion #288

paleolimbot · 2023-08-24T00:27:09Z

This PR implements extension type registration, enabling nanoarrow or other R packages to register conversions to and from R vectors. It also implements the "vctrs" extension type, which is also implemented in the Arrow R package: this extension type allows any R vector whose storage type is supported by nanoarrow to be roundtripped through an Arrow array. It's also a good test for the extension type mechanism since it has diverse storage and R vector type requirements.

This required some changes to the conversion system: when calling as_nanoarrow_array(x, type = some_extension_type()), the extension needs to have the first chance at handling the conversion. When calling convert_array(some_extension_array), similarly, it needs to have a chance to perform the conversion. The conversion process in both directions has become sufficiently complicated that I think it may need some refactoring; however, I think the intended logic and test coverage is there in this PR.

library(nanoarrow)

# Works from Arrow
vctr <- wk::wkt("POINT (0 1)", crs = "OGC:CRS84")

arrow_ext_array <- arrow::vctrs_extension_array(vctr)
(array <- as_nanoarrow_array(arrow_ext_array))
#> <nanoarrow_array arrow.r.vctrs{string}[1]>
#>  $ length    : int 1
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 3
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   ..$ :<nanoarrow_buffer data_offset<int32>[2][8 b]> `0 11`
#>   ..$ :<nanoarrow_buffer data<string>[11 b]> `POINT (0 1)`
#>  $ dictionary: NULL
#>  $ children  : list()
infer_nanoarrow_ptype(array)
#> <wk_wkt[0] with CRS=OGC:CRS84>
convert_array(array)
#> <wk_wkt[1] with CRS=OGC:CRS84>
#> [1] POINT (0 1)

# Can also create vctrs extension arrays
array <- as_nanoarrow_array(
  vctr,
  schema = na_vctrs(vctr)
)
array
#> <nanoarrow_array arrow.r.vctrs{string}[1]>
#>  $ length    : int 1
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 3
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   ..$ :<nanoarrow_buffer data_offset<int32>[2][8 b]> `0 11`
#>   ..$ :<nanoarrow_buffer data<string>[11 b]> `POINT (0 1)`
#>  $ dictionary: NULL
#>  $ children  : list()
infer_nanoarrow_ptype(array)
#> <wk_wkt[0] with CRS=OGC:CRS84>
convert_array(array)
#> <wk_wkt[1] with CRS=OGC:CRS84>
#> [1] POINT (0 1)

^{Created on 2023-08-29 with reprex v2.0.2}

codecov-commenter · 2023-08-24T00:30:32Z

Codecov Report

Merging #288 (17eaf46) into main (4ace1a2) will decrease coverage by 0.19%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main     #288      +/-   ##
==========================================
- Coverage   86.97%   86.79%   -0.19%     
==========================================
  Files          66       64       -2     
  Lines       10322     9981     -341     
==========================================
- Hits         8978     8663     -315     
+ Misses       1344     1318      -26

Files Changed	Coverage Δ
r/R/type-extension.R	`0.00% <0.00%> (ø)`

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…ert_array() everywhere.

paleolimbot · 2023-08-30T12:44:01Z

@krlmlr I hate to ask for your time again; however, this PR would definitely benefit from your 👀 if you have the bandwidth to take a look!

krlmlr

Thanks, very exciting!

I only skimmed over it. I wonder if it's worth splitting out some of the refactorings and merging them individually.

krlmlr · 2023-08-31T04:19:11Z

r/R/convert-array.R

+    schema <- .Call(nanoarrow_c_infer_schema_array, array)
+    parsed <- .Call(nanoarrow_c_schema_parse, schema)


Since this seems to throw away schema, perhaps offer a C entry point that combines both functions?

This is a good observation that exposes a larger design flaw in the current system: at the point that convert_array() is called, we pretty much need the parsed version of a schema (i.e., ArrowSchemaView) and the parsed version of the array (i.e., ArrowArrayView). I wrote the Python bindings to handle this from the ground up but here there are a number of hacks like this one (and more than a few in C). I'm going to try to fix all of those in a more systemic way (but don't quite have the time to do it yet).

krlmlr · 2023-08-31T04:20:30Z

r/R/convert-array.R

+      if (is.data.frame(values)) {
+        return(values[indices + 1L, , drop = FALSE])
+      } else {
+        return(values[indices + 1L])
+      }


This is vec_slice() :

Suggested change

if (is.data.frame(values)) {

return(values[indices + 1L, , drop = FALSE])

} else {

return(values[indices + 1L])

}

return(vec_slice(values, indices + 1L))

This also exposes a larger design question...currently, nanoarrow does not require vctrs for most data types (it does require it for converting lists-of-things into R vectors, and I think this is a good thing). I will extract a utility function so that it is more obvious what this does/easier to switch if we take on a vctrs dependency.

krlmlr · 2023-08-31T04:21:46Z

r/R/convert-array.R

+      spec <- resolve_nanoarrow_extension(parsed$extension_name)
+      return(convert_array_extension(spec, array, to, ...))
+    }
+
    # Handle default dictionary conversion since it's the same for all types
    dictionary <- array$dictionary

    if (!is.null(dictionary)) {
      values <- .Call(nanoarrow_c_convert_array, dictionary, to)
      array$dictionary <- NULL
      indices <- .Call(nanoarrow_c_convert_array, array, integer())


Perhaps enable this function to add an offset directly, to avoid the + 1L below?

That is definitely a good optimization...ideally there would be an offset + scale for all numerics (scale is used for timestamp and difftime conversion). I would like to move the conversion process such to something like get_converter(array, to) returning something that could be constructed like int_vector_converter(offset = 1). I'm currently trying to (for better or possibly worse) provide full type coverage (and associated test coverage), at the expense of speed for the less frequently used types.

r/R/extension-vctrs.R

krlmlr · 2023-08-31T04:40:19Z

r/src/materialize.c

+      }
+      break;
+    default:
+      Rf_error("Unhandled SEXP type in copy_vec_into()");


CPLXSXP ?

krlmlr · 2023-08-31T04:41:54Z

r/tests/testthat/test-extension-vctrs.R

+    posixlt = as.POSIXlt("2000-01-01 12:23", tz = "UTC"),
+    date = as.Date("2000-01-01"),
+    difftime = as.difftime(123, units = "secs"),
+    data_frame_simple = tibble::tibble(x = 1:5),


Do you support packed columns (structs) here too?

Yes (I think what you're referring to is tested on the next line?)

krlmlr · 2023-08-31T04:43:32Z

r/tests/testthat/test-extension-vctrs.R

+    data_frame_nested = tibble::tibble(x = 1:5, y = tibble::tibble(z = letters[1:5]))
+  )
+
+  for (nm in names(vectors)) {


Experience from DBItest has taught me that you'll be much better off writing multiple test_that(...) blocks, one per element. Why not create a small script that generates this code for you?

Canonical reference: https://mtlynch.io/good-developers-bad-tests/ .

It's a good point...I wish testthat had better support for parameterized tests or SCOPED_TRACE() like in GTest. I will see if there is any way to restructure this to improve reporting.

krlmlr · 2023-08-31T04:44:12Z

r/tests/testthat/test-extension-vctrs.R

+    stream <- basic_array_stream(list(array, array))
+    expect_identical(convert_array_stream(stream), vctrs::vec_rep(vctr, 2))
+
+    if (requireNamespace("arrow", quietly = TRUE)) {


This will also allow you to use skip_if_not_installed("arrow") here.

krlmlr · 2023-08-31T04:45:28Z

r/tests/testthat/test-extension.R

+  )
+  on.exit(unregister_nanoarrow_extension("some_ext"))
+
+  infer_nanoarrow_ptype_extension.some_spec_class <- function(spec, x, ...) {


There's rlang::local_bindings() .

Co-authored-by: Kirill Müller <[email protected]>

paleolimbot

Thank you!

stub extension type implementation

17eaf46

paleolimbot and others added 13 commits August 23, 2023 21:57

theoretically handle infer_ptype

3d89001

theoretical support for default convert

fa3d439

handle in as_nanoarrow_array()

fb5996e

format

236732a

remove debug message

8dbcd9c

extension array creator; add ability to return a data frame from conv…

d76da8b

…ert_array() everywhere.

test registration

d910910

test as_array

d7ce9c3

clean up schema parsing in as_array.c

a4a3522

codespell

cfce95c

vctrs extension

0224bea

fix + test a bunch of vctr types

3c1d106

check Arrow, too

8b8caf9

paleolimbot marked this pull request as ready for review August 29, 2023 19:54

krlmlr reviewed Aug 31, 2023

View reviewed changes

paleolimbot and others added 8 commits August 31, 2023 09:54

Update r/R/extension-vctrs.R

a62801b

Co-authored-by: Kirill Müller <[email protected]>

Update r/R/extension-vctrs.R

dca92a2

Co-authored-by: Kirill Müller <[email protected]>

unify vector helpers while I'm thinking about it

2076bb7

consolidate vec_slice-like usage

a47575f

disambiguate ptype resolution

db71daa

make sure to is repsected

2e3e2e7

document

c6491e5

maybe more predictable vctrs usage for R versions

1b8081f

paleolimbot commented Sep 1, 2023

View reviewed changes

paleolimbot merged commit 65bb552 into apache:main Sep 1, 2023

paleolimbot deleted the r-extension-types branch September 1, 2023 15:40

paleolimbot mentioned this pull request Sep 1, 2023

[R] Add extension type support #218

Closed

paleolimbot modified the milestones: nanoarrow 0.4.0, nanoarrow 0.3.0 Sep 21, 2023

eitsupi mentioned this pull request Dec 4, 2023

Rewrite as_polars_series (for vctr package based vectors) pola-rs/r-polars#570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(r): Implement extension type registration/conversion #288

feat(r): Implement extension type registration/conversion #288

paleolimbot commented Aug 24, 2023 •

edited

Loading

codecov-commenter commented Aug 24, 2023

paleolimbot commented Aug 30, 2023

krlmlr left a comment

krlmlr Aug 31, 2023

paleolimbot Aug 31, 2023

krlmlr Aug 31, 2023

paleolimbot Aug 31, 2023

krlmlr Aug 31, 2023

paleolimbot Aug 31, 2023

krlmlr Aug 31, 2023

krlmlr Aug 31, 2023

paleolimbot Aug 31, 2023

krlmlr Aug 31, 2023

paleolimbot Aug 31, 2023

krlmlr Aug 31, 2023

krlmlr Aug 31, 2023

paleolimbot left a comment

		schema <- .Call(nanoarrow_c_infer_schema_array, array)
		parsed <- .Call(nanoarrow_c_schema_parse, schema)

feat(r): Implement extension type registration/conversion #288

feat(r): Implement extension type registration/conversion #288

Conversation

paleolimbot commented Aug 24, 2023 • edited Loading

codecov-commenter commented Aug 24, 2023

Codecov Report

paleolimbot commented Aug 30, 2023

krlmlr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

paleolimbot commented Aug 24, 2023 •

edited

Loading