Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(r): Implement dictionary conversion #285

Merged
merged 25 commits into from
Aug 23, 2023

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Aug 21, 2023

This PR implements conversion from dictionary-encoded arrays to R vectors. To make this happen, it was also necessary to refactor a few things about the conversion process to make writing conversions in R easier. This is also with an eye towards extension type conversion (up next) where it is expected that extension types will mostly implement conversions as S3 convert_array() method. Previously this didn't work if converting a stream with multiple batches...now it does (although will incur S3 dispatch overhead for each batch for conversions that are not handled in C).

Conversion from a dictionary always defaults to the conversion that would happen from the dictionary value type. This is because in Arrow the dictionary is a property of the array, not the type (whereas in R, a factor's levels are a property of the type, sort of). For a dictionary-encoded array that arrives in chunks, the only type-stable way to perform the conversion is by resolving the dictionary into the value type. This also makes it a more predictable default for other dictionary-encoded types (none of which have an R analogue).

In practice this has low memory overhead because the most commonly dictionary-encoded type is a string, and because R has a global string pool; however, it is quite a bit slower than decoding directly to factor().

Like other conversions (and like vctrs::vec_cast()), you can also request a target ptype. This is implemented for factor(). If converting an array stream containing dictionary-encoded values, the levels need to be explicitly specified.

library(nanoarrow)

dict_array <- as_nanoarrow_array(sample(0:9, 10, replace = TRUE))
dict_array$dictionary <- as_nanoarrow_array(stringr::words[1:10])

# default conversion is always the diciontary (value) type
infer_nanoarrow_ptype(dict_array)
#> character(0)
convert_array(dict_array)
#>  [1] "absolute" "a"        "across"   "achieve"  "accept"   "a"       
#>  [7] "active"   "a"        "a"        "across"

# this makes it type stable when consuming streams
dict1 <- as_nanoarrow_array(factor(letters[1:5]))
dict2 <- as_nanoarrow_array(factor(letters[6:10]))
stream <- basic_array_stream(list(dict1, dict2))
convert_array_stream(stream)
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

# ...and more consistent when dealing with other value types
dict_array$dictionary <- as_nanoarrow_array(arrow::as_arrow_array(lapply(1:10, function(i) 1:5 * i)))
infer_nanoarrow_ptype(dict_array)
#> <list_of<integer>[0]>

# If you want a factor, you can specify it explicitly
dict_array$dictionary <- as_nanoarrow_array(stringr::words[1:10])
convert_array(dict_array, vctrs::partial_factor())
#>  [1] "absolute" "a"        "across"   "achieve"  "accept"   "a"       
#>  [7] "active"   "a"        "a"        "across"
convert_array(dict_array, factor(levels = stringr::words[10:1]))
#>  [1] absolute a        across   achieve  accept   a        active   a       
#>  [9] a        across  
#> Levels: active act across achieve account accept absolute about able a

# Also works for nesting
nested_with_dict <- as_nanoarrow_array(data.frame(x = factor(letters[1:5])))

convert_array(nested_with_dict)
#>   x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
convert_array(nested_with_dict, vctrs::partial_frame(x = factor()))
#>   x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e

It's hard to benchmark this because ALTREP makes the effect of some conversions deferred, but it seems like the performance characteristics are vaguely similar to what happens in Arrow (probably Arrow is faster for many chunks):

library(nanoarrow)

num_levels <- 1e1
vec_len <- 1e6

vec <- nanoarrow:::vec_gen(character(), n = num_levels)
vec_chr <- do.call(c, rep(list(vec), vec_len / num_levels))
vec_chr <- nanoarrow:::vec_shuffle(vec_chr)
vec_fct <- factor(vec_chr)
levels <- levels(vec_fct)

array_chr <- as_nanoarrow_array(vec_chr)
array_fct <- as_nanoarrow_array(vec_fct)

arrow_chr <- arrow::as_arrow_array(array_chr)
arrow_fct <- arrow::as_arrow_array(array_fct)

bench::mark(
  convert_array(array_chr),
  as.vector(arrow_chr)
)
#> # A tibble: 2 × 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(array_chr)    2.3µs   2.71µs   328706.   23.95KB     65.8
#> 2 as.vector(arrow_chr)       4.26µs    4.8µs   191042.    7.87KB     38.2

bench::mark(
  convert_array(array_fct),
  convert_array(array_fct, factor()),
  convert_array(array_fct, factor(levels = levels)),
  as.vector(arrow_fct$cast(arrow::string())),
  as.vector(arrow_fct),
  check = FALSE
)
#> # A tibble: 5 × 6
#>   expression                                             min   median `itr/sec`
#>   <bch:expr>                                        <bch:tm> <bch:tm>     <dbl>
#> 1 convert_array(array_fct)                           40.55ms  41.06ms      24.4
#> 2 convert_array(array_fct, factor())                  2.97ms   3.18ms     306. 
#> 3 convert_array(array_fct, factor(levels = levels))   2.83ms   3.15ms     314. 
#> 4 as.vector(arrow_fct$cast(arrow::string()))         48.35ms  49.18ms      20.2
#> 5 as.vector(arrow_fct)                                6.19µs   7.22µs  110825. 
#> # ℹ 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

# Mitigate ALTREP
bench::mark(
  convert_array(array_fct)[1e6:1],
  convert_array(array_fct, factor())[1e6:1],
  convert_array(array_fct, factor(levels = levels))[1e6:1],
  as.vector(arrow_fct$cast(arrow::string()))[1e6:1],
  as.vector(arrow_fct)[1e6:1],
  check = FALSE
)
#> # A tibble: 5 × 6
#>   expression                                                      min   median
#>   <bch:expr>                                                 <bch:tm> <bch:tm>
#> 1 convert_array(array_fct)[1e+06:1]                           45.87ms  45.87ms
#> 2 convert_array(array_fct, factor())[1e+06:1]                  5.42ms   5.78ms
#> 3 convert_array(array_fct, factor(levels = levels))[1e+06:1]   5.39ms   5.79ms
#> 4 as.vector(arrow_fct$cast(arrow::string()))[1e+06:1]         92.85ms  94.93ms
#> 5 as.vector(arrow_fct)[1e+06:1]                               15.42ms  15.76ms
#> # ℹ 3 more variables: `itr/sec` <dbl>, mem_alloc <bch:byt>, `gc/sec` <dbl>

Created on 2023-08-23 with reprex v2.0.2

@codecov-commenter
Copy link

codecov-commenter commented Aug 21, 2023

Codecov Report

Merging #285 (7b54981) into main (7fecddf) will decrease coverage by 0.20%.
Report is 1 commits behind head on main.
The diff coverage is 83.96%.

@@            Coverage Diff             @@
##             main     #285      +/-   ##
==========================================
- Coverage   87.12%   86.93%   -0.20%     
==========================================
  Files          66       63       -3     
  Lines       10246     9965     -281     
==========================================
- Hits         8927     8663     -264     
+ Misses       1319     1302      -17     
Files Changed Coverage Δ
r/src/convert.c 92.41% <ø> (-0.77%) ⬇️
r/src/materialize_blob.h 87.50% <0.00%> (ø)
r/src/materialize_date.h 62.50% <0.00%> (ø)
r/R/infer-ptype.R 97.67% <50.00%> (-2.33%) ⬇️
r/src/materialize_unspecified.h 94.44% <50.00%> (-5.56%) ⬇️
r/src/materialize.c 85.29% <79.31%> (-1.81%) ⬇️
r/R/convert-array.R 91.54% <87.80%> (-5.96%) ⬇️
r/src/convert_array.c 89.56% <100.00%> (-2.40%) ⬇️
r/src/materialize_chr.h 100.00% <100.00%> (ø)
r/src/materialize_dbl.h 100.00% <100.00%> (+2.00%) ⬆️
... and 2 more

... and 5 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@paleolimbot paleolimbot marked this pull request as ready for review August 23, 2023 14:05
Copy link
Contributor

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very clean to me. I have only skimmed through the C changes, protection looks sane.

@paleolimbot
Copy link
Member Author

Thank you for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants