Skip to content

Commit

Permalink
Add type = "parquet" (#729)
Browse files Browse the repository at this point in the history
* Add `type = "parquet"`

* Update tests

* Update vignette/README, plus redocument

* Update NEWS

* Update vignettes/pins.Rmd

Co-authored-by: Hadley Wickham <[email protected]>

* Update advice on `type`

---------

Co-authored-by: Hadley Wickham <[email protected]>
  • Loading branch information
juliasilge and hadley authored Mar 6, 2023
1 parent 36e86d5 commit 3406105
Show file tree
Hide file tree
Showing 8 changed files with 39 additions and 19 deletions.
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

* `board_s3()` now uses pagination for listing and versioning (#719, @mzorko).

* Added `type = "parquet"` to read and write Parquet files (#729).

# pins 1.1.0

## Breaking changes
Expand Down
21 changes: 17 additions & 4 deletions R/pin-read-write.R
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,9 @@ pin_read <- function(board, name, version = NULL, hash = NULL, ...) {
#' When retrieving the pin, this will be stored in the `user` key, to
#' avoid potential clashes with the metadata that pins itself uses.
#' @param type File type used to save `x` to disk. Must be one of
#' "csv", "json", "rds", "arrow", or "qs". If not supplied, will use JSON for
#' bare lists and RDS for everything else. Be aware that CSV and JSON are
#' plain text formats, while RDS, Arrow, and
#' "csv", "json", "rds", "parquet", "arrow", or "qs". If not supplied, will
#' use JSON for bare lists and RDS for everything else. Be aware that CSV and
#' JSON are plain text formats, while RDS, Parquet, Arrow, and
#' [qs](https://CRAN.R-project.org/package=qs) are binary formats.
#' @param versioned Should the pin be versioned? The default, `NULL`, will
#' use the default for `board`
Expand Down Expand Up @@ -133,6 +133,7 @@ object_write <- function(x, path, type = "rds") {
switch(type,
rds = write_rds(x, path),
json = jsonlite::write_json(x, path, auto_unbox = TRUE),
parquet = write_parquet(x, path),
arrow = write_arrow(x, path),
pickle = abort("'pickle' pins not supported in R"),
joblib = abort("'joblib' pins not supported in R"),
Expand Down Expand Up @@ -168,13 +169,19 @@ write_qs <- function(x, path) {
invisible(path)
}

write_parquet <- function(x, path) {
check_installed("arrow")
arrow::write_parquet(x, path)
invisible(path)
}

write_arrow <- function(x, path) {
check_installed("arrow")
arrow::write_feather(x, path)
invisible(path)
}

object_types <- c("rds", "json", "arrow", "pickle", "csv", "qs", "file")
object_types <- c("rds", "json", "parquet", "arrow", "pickle", "csv", "qs", "file")

object_read <- function(meta) {
path <- fs::path(meta$local$dir, meta$file)
Expand All @@ -189,6 +196,7 @@ object_read <- function(meta) {
switch(type,
rds = readRDS(path),
json = jsonlite::read_json(path, simplifyVector = TRUE),
parquet = read_parquet(path),
arrow = read_arrow(path),
pickle = abort("'pickle' pins not supported in R"),
joblib = abort("'joblib' pins not supported in R"),
Expand Down Expand Up @@ -217,6 +225,11 @@ read_qs <- function(path) {
qs::qread(path, strict = TRUE)
}

read_parquet <- function(path) {
check_installed("arrow")
arrow::read_parquet(path)
}

read_arrow <- function(path) {
check_installed("arrow")
arrow::read_feather(path)
Expand Down
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ It takes three arguments: the board to pin to, an object, and a name:
board %>% pin_write(head(mtcars), "mtcars")
```

As you can see, the data saved as an `.rds` by default, but depending on what you're saving and who else you want to read it, you might use the `type` argument to instead save it as a `csv`, `json`, or `arrow` file.
As you can see, the data saved as an `.rds` by default, but depending on what you're saving and who else you want to read it, you might use the `type` argument to instead save it as a Parquet, Arrow, CSV, or JSON file.

You can later retrieve the pinned data with `pin_read()`:

Expand Down
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ board <- board_temp()
board
#> Pin board <pins_board_folder>
#> Path:
#> '/var/folders/hv/hzsmmyk9393_m7q3nscx1slc0000gn/T/RtmpTxyyP1/pins-114c073a9ddd2'
#> '/var/folders/hv/hzsmmyk9393_m7q3nscx1slc0000gn/T/RtmpwGre3p/pins-15a8b4f3f602c'
#> Cache size: 0
```

Expand All @@ -71,13 +71,14 @@ arguments: the board to pin to, an object, and a name:
``` r
board %>% pin_write(head(mtcars), "mtcars")
#> Guessing `type = 'rds'`
#> Creating new version '20230223T220424Z-a800d'
#> Creating new version '20230303T233508Z-a800d'
#> Writing to pin 'mtcars'
```

As you can see, the data saved as an `.rds` by default, but depending on
what you’re saving and who else you want to read it, you might use the
`type` argument to instead save it as a `csv`, `json`, or `arrow` file.
`type` argument to instead save it as a Parquet, Arrow, CSV, or JSON
file.

You can later retrieve the pinned data with `pin_read()`:

Expand Down
6 changes: 3 additions & 3 deletions man/pin_read.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion tests/testthat/_snaps/pin-read-write.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
pin_write(board, mtcars, name = "mtcars", type = "froopy-loops")
Condition
Error in `object_write()`:
! `type` must be one of "rds", "json", "arrow", "pickle", "csv", or "qs", not "froopy-loops".
! `type` must be one of "rds", "json", "parquet", "arrow", "pickle", "csv", or "qs", not "froopy-loops".
Code
pin_write(board, mtcars, name = "mtcars", metadata = 1)
Condition
Expand Down
9 changes: 6 additions & 3 deletions tests/testthat/test-pin-read-write.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,16 @@ test_that("can round trip all types", {
pin_write(board, df, "df-1", type = "rds")
expect_equal(pin_read(board, "df-1"), df)

pin_write(board, df, "df-2", type = "arrow")
pin_write(board, df, "df-2", type = "parquet")
expect_equal(pin_read(board, "df-2"), df)

pin_write(board, df, "df-3", type = "csv")
pin_write(board, df, "df-3", type = "arrow")
expect_equal(pin_read(board, "df-2"), df)

pin_write(board, df, "df-4", type = "csv")
expect_equal(pin_read(board, "df-3"), df)

pin_write(board, df, "df-4", type = "qs")
pin_write(board, df, "df-5", type = "qs")
expect_equal(pin_read(board, "df-4"), df)

# List
Expand Down
9 changes: 5 additions & 4 deletions vignettes/pins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,11 @@ The only rule for a pin name is that it can't contain slashes.
As you can see from the output, pins has chosen to save this data to an `.rds` file.
But you can choose another option depending on your goals:

- `type = "rds"` uses `writeRDS()` to create a binary R data file. It can save any R object but it's only readable from R, not other languages.
- `type = "csv"` uses `write.csv()` to create a `.csv` file. CSVs can read by any application, but only support simple columns (e.g. numbers, strings, dates), can take up a lot of disk space, and can be slow to read.
- `type = "arrow"` uses `arrow::write_feather()` to create an arrow/feather file. [Arrow](https://arrow.apache.org) is a modern, language-independent, high-performance file format designed for data science. Not every tool can read arrow files, but support is growing rapidly.
- `type = "json"` uses `jsonlite::write_json()` to create a `.json` file. Pretty much every programming language can read json files, but they only work well for nested lists.
- `type = "rds"` uses `writeRDS()` to create a binary R data file. It can save any R object (including trained models) but it's only readable from R, not other languages.
- `type = "csv"` uses `write.csv()` to create a CSV file. CSVs are plain text and can be read easily by many applications, but they only support simple columns (e.g. numbers, strings), can take up a lot of disk space, and can be slow to read.
- `type = "parquet"` uses `arrow::write_parquet()` to create a Parquet file. [Parquet](https://parquet.apache.org/) is a modern, language-independent, column-oriented file format for efficient data storage and retrieval. Parquet is an excellent choice for storing tabular data but requires the [arrow](https://arrow.apache.org/docs/r/) package.
- `type = "arrow"` uses `arrow::write_feather()` to create an Arrow/Feather file.
- `type = "json"` uses `jsonlite::write_json()` to create a JSON file. Pretty much every programming language can read json files, but they only work well for nested lists.
- `type = "qs"` uses `qs::qsave()` to create a binary R data file, like `writeRDS()`. This format achieves faster read/write speeds than RDS, and compresses data more efficiently, making it a good choice for larger objects. Read more on the [qs package](https://github.com/traversc/qs).

After you've pinned an object, you can read it back with `pin_read()`:
Expand Down

0 comments on commit 3406105

Please sign in to comment.