Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could pin_write() support type = "arrow" uncompressed? #631

Closed
ijlyttle opened this issue Aug 9, 2022 · 10 comments · Fixed by #658
Closed

could pin_write() support type = "arrow" uncompressed? #631

ijlyttle opened this issue Aug 9, 2022 · 10 comments · Fixed by #658
Labels
feature a feature request or enhancement

Comments

@ijlyttle
Copy link
Contributor

ijlyttle commented Aug 9, 2022

This may also impact pins for Python, but I thought I would start here.

The motivation for this question is that Arrow for JavaScript does not support compression. I know there's a workaround using pin_upload() and pin_download(), but it seems to add friction for R/Python users.

To be clear, I'm not suggesting that any default behavior change; just to ask about the possibility for something like a compression = FALSE argument somewhere.

Here's a larger writeup describing my motivation, with a first hack on how a BoardUrl could work using JavaScript.

Thanks!

@machow
Copy link
Collaborator

machow commented Aug 9, 2022

Ah, that's too bad about compression in js :/ . Since pins stores metadata for everything, it seems like having a strategy for specifying save/load options there could be useful..

Maybe related in long-term to rstudio/pins-python#18

Right now arguments aren't passed to savers when someone does pin_write, so a quick fix would be adding a new type (e.g. "uncompressed_arrow"?)

@juliasilge
Copy link
Member

The way we handle the dots now from the user-facing function pin_write() passes them through to pin_store(), which means we can do things as outlined in #628 but we can't use the dots for this which happens in object_write().

I wonder if a new "uncompressed_arrow" type is perhaps the best way to go? Or should we try to support the different compress options that RDS has as well when considering this? We could add a new compression or compress argument to pin_write() after the dots without causing problems.

@ijlyttle
Copy link
Contributor Author

ijlyttle commented Aug 10, 2022

Could something like write_args, a list with arguments to pass to the writing-function, work here?
(I'm certain there's a better name)

object_write <- function(x, path, type = "rds", write_args = NULL) {
  type <- arg_match0(type, setdiff(object_types, "file"))

  switch(type,
    arrow = rlang::exec(arrow::write_feather, x, path, !!!write_args) 
    # you get the idea
  )

  path
}

This might present a problem where write_args wants to override an already-named argument, e.g.:

 json = jsonlite::write_json(x, path, auto_unbox = TRUE)

@juliasilge juliasilge added the feature a feature request or enhancement label Aug 10, 2022
@juliasilge
Copy link
Member

That could be a great option, I think. This would still have to go after the dots to avoid a breaking change; the function signature would end up looking like:

pin_write(
  board,
  x,
  name = NULL,
  type = NULL,
  title = NULL,
  description = NULL,
  metadata = NULL,
  versioned = NULL,
  ...
  write_args = list()
)

@ijlyttle
Copy link
Contributor Author

I am on board!

At the risk of being pedantic, I wonder if the argument should be named .write_args, as it comes after the dots - I think this is to help avoid collisions with arg names that could be sent using the dots.

Happy to start on a PR, if you like.

@machow
Copy link
Collaborator

machow commented Aug 11, 2022

I think adding the ability for people to customize how pins are saved will be really useful--but would prefer we separate out the immediate value of supporting uncompressed arrow, from the more high risk / high reward task of how to provide general extension for save / load.

Looking at tools like vetiver, it's clear that how things are saved and how they're loaded are coupled to each other. My concern is that addressing save on its own may work for self-describing formats (e.g. Rds, arrow), but I also expect customizing write to create new surprising behaviors.

If we could empower devs to steward the save/load process (e.g. create a new type that can do w/e it wants with something like write_args), it can help identify who is responsible when extension goes awry (e.g. error hit while using writing "some_extension_type", so we know it is not a pins responsibility to fix).

@ijlyttle
Copy link
Contributor Author

Holding for now :)

@juliasilge
Copy link
Member

@ijlyttle Would you be open to doing a PR where we add a new type, perhaps called "custom" (what could be a better name?) here:

object_types <- c("rds", "json", "arrow", "pickle", "csv", "qs", "file")

We would expect folks to write code like pin_write(board, x, "my-name", .f = arrow::write_feather(compression = "uncompressed")). The argument .f would be new, after the dots.

And then the entry in switch() for the new custom type:

pins-r/R/pin-read-write.R

Lines 126 to 133 in debc945

switch(type,
rds = write_rds(x, path),
json = jsonlite::write_json(x, path, auto_unbox = TRUE),
arrow = arrow::write_feather(x, path),
pickle = abort("'pickle' pins not supported in R"),
csv = utils::write.csv(x, path, row.names = FALSE),
qs = write_qs(x, path)
)

would be something like eval(rlang::call_modify(.f, x, path)). That seems to work pretty well:

library(rlang)

call <- quote(arrow::write_feather(compression = "uncompressed"))
path <- tempfile()
x <- mtcars
new_call <- call_modify(call, x, path)
eval(new_call)
head(arrow::read_feather(path))
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Created on 2022-08-18 with reprex v2.0.2

We maybe should do a bit more careful checking of the call and use something like call_match(new_call, new_call[[1]])? And put a nice error message for users who pass in something that won't work.

What do you think?

@ijlyttle
Copy link
Contributor Author

ijlyttle commented Aug 19, 2022

I like it! And yes, happy to propose a PR!

Just so I don't forget, I'd want to be able to use pin_read() and have it read it like an Arrow file. This may be a bit ambitious, but I figured I'd get the thought out.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
3 participants