feat(r): Implement dictionary conversion #285

paleolimbot · 2023-08-21T20:12:54Z

This PR implements conversion from dictionary-encoded arrays to R vectors. To make this happen, it was also necessary to refactor a few things about the conversion process to make writing conversions in R easier. This is also with an eye towards extension type conversion (up next) where it is expected that extension types will mostly implement conversions as S3 convert_array() method. Previously this didn't work if converting a stream with multiple batches...now it does (although will incur S3 dispatch overhead for each batch for conversions that are not handled in C).

Conversion from a dictionary always defaults to the conversion that would happen from the dictionary value type. This is because in Arrow the dictionary is a property of the array, not the type (whereas in R, a factor's levels are a property of the type, sort of). For a dictionary-encoded array that arrives in chunks, the only type-stable way to perform the conversion is by resolving the dictionary into the value type. This also makes it a more predictable default for other dictionary-encoded types (none of which have an R analogue).

In practice this has low memory overhead because the most commonly dictionary-encoded type is a string, and because R has a global string pool; however, it is quite a bit slower than decoding directly to factor().

Like other conversions (and like vctrs::vec_cast()), you can also request a target ptype. This is implemented for factor(). If converting an array stream containing dictionary-encoded values, the levels need to be explicitly specified.

library(nanoarrow)

dict_array <- as_nanoarrow_array(sample(0:9, 10, replace = TRUE))
dict_array$dictionary <- as_nanoarrow_array(stringr::words[1:10])

# default conversion is always the diciontary (value) type
infer_nanoarrow_ptype(dict_array)
#> character(0)
convert_array(dict_array)
#>  [1] "absolute" "a"        "across"   "achieve"  "accept"   "a"       
#>  [7] "active"   "a"        "a"        "across"

# this makes it type stable when consuming streams
dict1 <- as_nanoarrow_array(factor(letters[1:5]))
dict2 <- as_nanoarrow_array(factor(letters[6:10]))
stream <- basic_array_stream(list(dict1, dict2))
convert_array_stream(stream)
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

# ...and more consistent when dealing with other value types
dict_array$dictionary <- as_nanoarrow_array(arrow::as_arrow_array(lapply(1:10, function(i) 1:5 * i)))
infer_nanoarrow_ptype(dict_array)
#> <list_of<integer>[0]>

# If you want a factor, you can specify it explicitly
dict_array$dictionary <- as_nanoarrow_array(stringr::words[1:10])
convert_array(dict_array, vctrs::partial_factor())
#>  [1] "absolute" "a"        "across"   "achieve"  "accept"   "a"       
#>  [7] "active"   "a"        "a"        "across"
convert_array(dict_array, factor(levels = stringr::words[10:1]))
#>  [1] absolute a        across   achieve  accept   a        active   a       
#>  [9] a        across  
#> Levels: active act across achieve account accept absolute about able a

# Also works for nesting
nested_with_dict <- as_nanoarrow_array(data.frame(x = factor(letters[1:5])))

convert_array(nested_with_dict)
#>   x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
convert_array(nested_with_dict, vctrs::partial_frame(x = factor()))
#>   x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e

It's hard to benchmark this because ALTREP makes the effect of some conversions deferred, but it seems like the performance characteristics are vaguely similar to what happens in Arrow (probably Arrow is faster for many chunks):

library(nanoarrow)

num_levels <- 1e1
vec_len <- 1e6

vec <- nanoarrow:::vec_gen(character(), n = num_levels)
vec_chr <- do.call(c, rep(list(vec), vec_len / num_levels))
vec_chr <- nanoarrow:::vec_shuffle(vec_chr)
vec_fct <- factor(vec_chr)
levels <- levels(vec_fct)

array_chr <- as_nanoarrow_array(vec_chr)
array_fct <- as_nanoarrow_array(vec_fct)

arrow_chr <- arrow::as_arrow_array(array_chr)
arrow_fct <- arrow::as_arrow_array(array_fct)

bench::mark(
  convert_array(array_chr),
  as.vector(arrow_chr)
)
#> # A tibble: 2 × 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(array_chr)    2.3µs   2.71µs   328706.   23.95KB     65.8
#> 2 as.vector(arrow_chr)       4.26µs    4.8µs   191042.    7.87KB     38.2

bench::mark(
  convert_array(array_fct),
  convert_array(array_fct, factor()),
  convert_array(array_fct, factor(levels = levels)),
  as.vector(arrow_fct$cast(arrow::string())),
  as.vector(arrow_fct),
  check = FALSE
)
#> # A tibble: 5 × 6
#>   expression                                             min   median `itr/sec`
#>   <bch:expr>                                        <bch:tm> <bch:tm>     <dbl>
#> 1 convert_array(array_fct)                           40.55ms  41.06ms      24.4
#> 2 convert_array(array_fct, factor())                  2.97ms   3.18ms     306. 
#> 3 convert_array(array_fct, factor(levels = levels))   2.83ms   3.15ms     314. 
#> 4 as.vector(arrow_fct$cast(arrow::string()))         48.35ms  49.18ms      20.2
#> 5 as.vector(arrow_fct)                                6.19µs   7.22µs  110825. 
#> # ℹ 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

# Mitigate ALTREP
bench::mark(
  convert_array(array_fct)[1e6:1],
  convert_array(array_fct, factor())[1e6:1],
  convert_array(array_fct, factor(levels = levels))[1e6:1],
  as.vector(arrow_fct$cast(arrow::string()))[1e6:1],
  as.vector(arrow_fct)[1e6:1],
  check = FALSE
)
#> # A tibble: 5 × 6
#>   expression                                                      min   median
#>   <bch:expr>                                                 <bch:tm> <bch:tm>
#> 1 convert_array(array_fct)[1e+06:1]                           45.87ms  45.87ms
#> 2 convert_array(array_fct, factor())[1e+06:1]                  5.42ms   5.78ms
#> 3 convert_array(array_fct, factor(levels = levels))[1e+06:1]   5.39ms   5.79ms
#> 4 as.vector(arrow_fct$cast(arrow::string()))[1e+06:1]         92.85ms  94.93ms
#> 5 as.vector(arrow_fct)[1e+06:1]                               15.42ms  15.76ms
#> # ℹ 3 more variables: `itr/sec` <dbl>, mem_alloc <bch:byt>, `gc/sec` <dbl>

^{Created on 2023-08-23 with reprex v2.0.2}

codecov-commenter · 2023-08-21T20:16:48Z

Codecov Report

Merging #285 (7b54981) into main (7fecddf) will decrease coverage by 0.20%.
Report is 1 commits behind head on main.
The diff coverage is 83.96%.

@@            Coverage Diff             @@
##             main     #285      +/-   ##
==========================================
- Coverage   87.12%   86.93%   -0.20%     
==========================================
  Files          66       63       -3     
  Lines       10246     9965     -281     
==========================================
- Hits         8927     8663     -264     
+ Misses       1319     1302      -17

Files Changed	Coverage Δ
r/src/convert.c	`92.41% <ø> (-0.77%)`	⬇️
r/src/materialize_blob.h	`87.50% <0.00%> (ø)`
r/src/materialize_date.h	`62.50% <0.00%> (ø)`
r/R/infer-ptype.R	`97.67% <50.00%> (-2.33%)`	⬇️
r/src/materialize_unspecified.h	`94.44% <50.00%> (-5.56%)`	⬇️
r/src/materialize.c	`85.29% <79.31%> (-1.81%)`	⬇️
r/R/convert-array.R	`91.54% <87.80%> (-5.96%)`	⬇️
r/src/convert_array.c	`89.56% <100.00%> (-2.40%)`	⬇️
r/src/materialize_chr.h	`100.00% <100.00%> (ø)`
r/src/materialize_dbl.h	`100.00% <100.00%> (+2.00%)`	⬆️
... and 2 more

... and 5 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

krlmlr

This looks very clean to me. I have only skimmed through the C changes, protection looks sane.

r/tests/testthat/test-convert-array.R

r/R/convert-array.R

Co-authored-by: Kirill Müller <[email protected]>

paleolimbot · 2023-08-23T17:23:55Z

Thank you for the review!

paleolimbot and others added 19 commits August 22, 2023 15:54

fix typo in error message

7808a6f

make the "call into R" materializer generic

5e4c497

with passing test

76a9e95

with passing test

2462279

support converting to factor

aca0325

notes about conversions

54e8260

prepare to handle "other"

ac8f304

start on materialize other

31c7f0d

run pre-commit

3589d2a

more formatting

5093002

start to simplify calls from C into R for conversion

5edd4ea

possibly simplify C -> S3 -> C circle

1404770

move materializer for header out of the common header

4c9f9ce

fix include

ed7739a

reuse convert_fallback_other everywhere

97bf004

format

8e2f405

document dictionary conversion

4c6969f

add more tests for dictionaries

70b390b

make sure convert to partial_factor fails in a batched convert

6898c6c

paleolimbot force-pushed the r-dictionary branch from fa80a63 to 6898c6c Compare August 22, 2023 19:09

fix for reprex

6c3ec16

paleolimbot marked this pull request as ready for review August 23, 2023 14:05

document, fix ordered conversion

07c38f8

krlmlr approved these changes Aug 23, 2023

View reviewed changes

r/tests/testthat/test-convert-array.R Outdated Show resolved Hide resolved

r/R/convert-array.R Outdated Show resolved Hide resolved

r/R/convert-array.R Outdated Show resolved Hide resolved

paleolimbot and others added 4 commits August 23, 2023 14:09

Update r/tests/testthat/test-convert-array.R

4451f5f

Co-authored-by: Kirill Müller <[email protected]>

Update r/R/convert-array.R

fb59210

Co-authored-by: Kirill Müller <[email protected]>

Update r/R/convert-array.R

907a0a8

Co-authored-by: Kirill Müller <[email protected]>

document

7b54981

paleolimbot merged commit 4ace1a2 into apache:main Aug 23, 2023

paleolimbot mentioned this pull request Aug 23, 2023

[R] Implement conversions for dictionary encoding #220

Closed

paleolimbot deleted the r-dictionary branch August 23, 2023 17:29

paleolimbot mentioned this pull request Aug 23, 2023

[R] Implement ALTREP conversions for a list of arrays #219

Open

paleolimbot modified the milestones: nanoarrow 0.4.0, nanoarrow 0.3.0 Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(r): Implement dictionary conversion #285

feat(r): Implement dictionary conversion #285

paleolimbot commented Aug 21, 2023 •

edited

Loading

codecov-commenter commented Aug 21, 2023 •

edited

Loading

krlmlr left a comment

paleolimbot commented Aug 23, 2023

feat(r): Implement dictionary conversion #285

feat(r): Implement dictionary conversion #285

Conversation

paleolimbot commented Aug 21, 2023 • edited Loading

codecov-commenter commented Aug 21, 2023 • edited Loading

Codecov Report

krlmlr left a comment

Choose a reason for hiding this comment

paleolimbot commented Aug 23, 2023

paleolimbot commented Aug 21, 2023 •

edited

Loading

codecov-commenter commented Aug 21, 2023 •

edited

Loading