Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type = "parquet" #729

Merged
merged 6 commits into from
Mar 6, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

* `board_s3()` now uses pagination for listing and versioning (#719, @mzorko).

* Added `type = "parquet"` to read and write Parquet files (#729).

# pins 1.1.0

## Breaking changes
Expand Down
21 changes: 17 additions & 4 deletions R/pin-read-write.R
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,9 @@ pin_read <- function(board, name, version = NULL, hash = NULL, ...) {
#' When retrieving the pin, this will be stored in the `user` key, to
#' avoid potential clashes with the metadata that pins itself uses.
#' @param type File type used to save `x` to disk. Must be one of
#' "csv", "json", "rds", "arrow", or "qs". If not supplied, will use JSON for
#' bare lists and RDS for everything else. Be aware that CSV and JSON are
#' plain text formats, while RDS, Arrow, and
#' "csv", "json", "rds", "parquet", "arrow", or "qs". If not supplied, will
#' use JSON for bare lists and RDS for everything else. Be aware that CSV and
#' JSON are plain text formats, while RDS, Parquet, Arrow, and
#' [qs](https://CRAN.R-project.org/package=qs) are binary formats.
#' @param versioned Should the pin be versioned? The default, `NULL`, will
#' use the default for `board`
Expand Down Expand Up @@ -133,6 +133,7 @@ object_write <- function(x, path, type = "rds") {
switch(type,
rds = write_rds(x, path),
json = jsonlite::write_json(x, path, auto_unbox = TRUE),
parquet = write_parquet(x, path),
arrow = write_arrow(x, path),
pickle = abort("'pickle' pins not supported in R"),
joblib = abort("'joblib' pins not supported in R"),
Expand Down Expand Up @@ -168,13 +169,19 @@ write_qs <- function(x, path) {
invisible(path)
}

write_parquet <- function(x, path) {
check_installed("arrow")
arrow::write_parquet(x, path)
invisible(path)
}

write_arrow <- function(x, path) {
check_installed("arrow")
arrow::write_feather(x, path)
invisible(path)
}

object_types <- c("rds", "json", "arrow", "pickle", "csv", "qs", "file")
object_types <- c("rds", "json", "parquet", "arrow", "pickle", "csv", "qs", "file")

object_read <- function(meta) {
path <- fs::path(meta$local$dir, meta$file)
Expand All @@ -189,6 +196,7 @@ object_read <- function(meta) {
switch(type,
rds = readRDS(path),
json = jsonlite::read_json(path, simplifyVector = TRUE),
parquet = read_parquet(path),
arrow = read_arrow(path),
pickle = abort("'pickle' pins not supported in R"),
joblib = abort("'joblib' pins not supported in R"),
Expand Down Expand Up @@ -217,6 +225,11 @@ read_qs <- function(path) {
qs::qread(path, strict = TRUE)
}

read_parquet <- function(path) {
check_installed("arrow")
arrow::read_parquet(path)
}

read_arrow <- function(path) {
check_installed("arrow")
arrow::read_feather(path)
Expand Down
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ It takes three arguments: the board to pin to, an object, and a name:
board %>% pin_write(head(mtcars), "mtcars")
```

As you can see, the data saved as an `.rds` by default, but depending on what you're saving and who else you want to read it, you might use the `type` argument to instead save it as a `csv`, `json`, or `arrow` file.
As you can see, the data saved as an `.rds` by default, but depending on what you're saving and who else you want to read it, you might use the `type` argument to instead save it as a Parquet, Arrow, CSV, or JSON file.

You can later retrieve the pinned data with `pin_read()`:

Expand Down
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ board <- board_temp()
board
#> Pin board <pins_board_folder>
#> Path:
#> '/var/folders/hv/hzsmmyk9393_m7q3nscx1slc0000gn/T/RtmpTxyyP1/pins-114c073a9ddd2'
#> '/var/folders/hv/hzsmmyk9393_m7q3nscx1slc0000gn/T/RtmpwGre3p/pins-15a8b4f3f602c'
#> Cache size: 0
```

Expand All @@ -71,13 +71,14 @@ arguments: the board to pin to, an object, and a name:
``` r
board %>% pin_write(head(mtcars), "mtcars")
#> Guessing `type = 'rds'`
#> Creating new version '20230223T220424Z-a800d'
#> Creating new version '20230303T233508Z-a800d'
#> Writing to pin 'mtcars'
```

As you can see, the data saved as an `.rds` by default, but depending on
what you’re saving and who else you want to read it, you might use the
`type` argument to instead save it as a `csv`, `json`, or `arrow` file.
`type` argument to instead save it as a Parquet, Arrow, CSV, or JSON
file.

You can later retrieve the pinned data with `pin_read()`:

Expand Down
6 changes: 3 additions & 3 deletions man/pin_read.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion tests/testthat/_snaps/pin-read-write.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
pin_write(board, mtcars, name = "mtcars", type = "froopy-loops")
Condition
Error in `object_write()`:
! `type` must be one of "rds", "json", "arrow", "pickle", "csv", or "qs", not "froopy-loops".
! `type` must be one of "rds", "json", "parquet", "arrow", "pickle", "csv", or "qs", not "froopy-loops".
Code
pin_write(board, mtcars, name = "mtcars", metadata = 1)
Condition
Expand Down
9 changes: 6 additions & 3 deletions tests/testthat/test-pin-read-write.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,16 @@ test_that("can round trip all types", {
pin_write(board, df, "df-1", type = "rds")
expect_equal(pin_read(board, "df-1"), df)

pin_write(board, df, "df-2", type = "arrow")
pin_write(board, df, "df-2", type = "parquet")
expect_equal(pin_read(board, "df-2"), df)

pin_write(board, df, "df-3", type = "csv")
pin_write(board, df, "df-3", type = "arrow")
expect_equal(pin_read(board, "df-2"), df)

pin_write(board, df, "df-4", type = "csv")
expect_equal(pin_read(board, "df-3"), df)

pin_write(board, df, "df-4", type = "qs")
pin_write(board, df, "df-5", type = "qs")
expect_equal(pin_read(board, "df-4"), df)

# List
Expand Down
3 changes: 2 additions & 1 deletion vignettes/pins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@ But you can choose another option depending on your goals:

- `type = "rds"` uses `writeRDS()` to create a binary R data file. It can save any R object but it's only readable from R, not other languages.
- `type = "csv"` uses `write.csv()` to create a `.csv` file. CSVs can read by any application, but only support simple columns (e.g. numbers, strings, dates), can take up a lot of disk space, and can be slow to read.
- `type = "arrow"` uses `arrow::write_feather()` to create an arrow/feather file. [Arrow](https://arrow.apache.org) is a modern, language-independent, high-performance file format designed for data science. Not every tool can read arrow files, but support is growing rapidly.
- `type = "parquet"` uses `arrow::write_parquet()` to create a Parquet file. [Parquet](https://parquet.apache.org/) is a modern, language-independent, column-oriented file format for efficient data storage and retrieval. Parquet is a storage format used with [Arrow](https://arrow.apache.org), an in-memory columnar format.
juliasilge marked this conversation as resolved.
Show resolved Hide resolved
- `type = "arrow"` uses `arrow::write_feather()` to create an Arrow/Feather file. Read the [FAQs from the Arrow project](https://arrow.apache.org/faq/) for more on the differences between Arrow and Parquet as file formats.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the language around Parquet and Arrow in the main vignette. Any suggestions for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd soften the language for arrow further (or just remove the second sentence); given the current position on that website, I don't think you'd ever want to use the arrow on-disk format.

- `type = "json"` uses `jsonlite::write_json()` to create a `.json` file. Pretty much every programming language can read json files, but they only work well for nested lists.
- `type = "qs"` uses `qs::qsave()` to create a binary R data file, like `writeRDS()`. This format achieves faster read/write speeds than RDS, and compresses data more efficiently, making it a good choice for larger objects. Read more on the [qs package](https://github.com/traversc/qs).

Expand Down