Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difference between duckplyr and dbplyr? #145

Closed
ywhcuhk opened this issue Apr 22, 2024 · 13 comments · Fixed by #191
Closed

difference between duckplyr and dbplyr? #145

ywhcuhk opened this issue Apr 22, 2024 · 13 comments · Fixed by #191

Comments

@ywhcuhk
Copy link

ywhcuhk commented Apr 22, 2024

I find it quite convenient to use duckdb as a backend of dplyr (through dbplyr). All you need to do is to specify a duckdb connection, and read data through duckdb's function. Then you can manipulate df using dplyr. e.g.,

library(duckdb)
library(dplyr)
con <- dbConnect(duckdb())

df = tbl(con, "read_parquet('yourdata.parquet')") 

So what can duckplyr do that dbplyr can't?

@krlmlr
Copy link
Member

krlmlr commented Apr 27, 2024

Thanks, good question.

I've started to add the following content to README.Rmd but am not sure where it really belongs. A new FAQ section perhaps?

The goal of duckplyr is to provide a fully compatible drop-in replacement for dplyr that uses DuckDB as a backend for fast operation.
DuckDB is an in-process OLAP database management system.
Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way.
If dplyr and duckplyr give different results on the same input data, this is considered a bug!

The operators, functions, and data types that currently cannot be translated to duckdb are executed in R.
The set of supported operations is continuously expanding.

@krlmlr
Copy link
Member

krlmlr commented Apr 29, 2024

This high-level blog post is a good intro too: https://posit.co/blog/duckplyr-dplyr-powered-by-duckdb/ .

@ywhcuhk
Copy link
Author

ywhcuhk commented Apr 30, 2024

Thanks a lot for the clarification. Just to check, when you say:

Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way.

Does this mean that when run into a function that is not available in duckdb, the data will be collected and the processed in RAM? One major advantage of using duckdb is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.

@krlmlr
Copy link
Member

krlmlr commented Apr 30, 2024

True. You can set the DUCKPLYR_FORCE environment variable to "TRUE" to avoid this, but many operations will be unavailable then.

@JosiahParry
Copy link

One major advantage of using duckdb is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.

I came here to seek this specific point of clarification. My understanding (through intuition and reading docs) is that duckdplyr is strictly for the in-memory DuckDB database and it does not support out-of-core operations.

Is that safe to say? So, if one is working with larger than memory data, they should consider using duckdb("path/to/db-dir"). Is that correct?

@krlmlr
Copy link
Member

krlmlr commented May 7, 2024

Here's how we're connecting to duckdb:

duckplyr:::create_default_duckdb_connection
#> function() {
#>   drv <- duckdb::duckdb()
#>   con <- DBI::dbConnect(drv)
#> 
#>   DBI::dbExecute(con, "set memory_limit='1GB'")
#>   DBI::dbExecute(con, paste0("pragma temp_directory='", tempdir(), "'"))
#> 
#>   duckdb$rapi_load_rfuns(drv@database_ref)
#> 
#>   for (i in seq_along(duckplyr_macros)) {
#>     sql <- paste0('CREATE MACRO "', names(duckplyr_macros)[[i]], '"', duckplyr_macros[[i]])
#>     DBI::dbExecute(con, sql)
#>   }
#> 
#>   con
#> }
#> <bytecode: 0x1135cd480>
#> <environment: namespace:duckplyr>

Created on 2024-05-07 with reprex v2.1.0

The memory is limited, we enable the temporary directory. We also support processing from and to files with duckplyr::*df_from_*() and duckplyr::df_to_parquet() .

What do you mean by "support out-of-core operations"?

@JosiahParry
Copy link

Thanks @krlmlr. I'm referring to this part of the DuckDB documentation

image

My understanding is that by only being able to use the default driver :memory:, the entire database is stored in RAM and does not allow for larger than memory workloads.

As I understand it, one of the main motivating points of DuckDB is this capability.

@PMassicotte
Copy link

PMassicotte commented May 7, 2024

This is interesting, I am also playing with both packages. I have noticed some difference in performance between these two libraries. I found out that dbplyr was a bit faster compared to duckplyr when querying parquet files on S3.

Did I miss something when exploring duckplyr ?

@krlmlr
Copy link
Member

krlmlr commented May 7, 2024

Thanks, Philippe, interesting. With current duckplyr, I'm seeing that the filter is not pushed down to Parquet. Could that play a role? What does the plan look like for dbplyr?

options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(DBI)

con <- duckplyr:::get_default_duckdb_connection()

dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result e60e0 connection=e5160 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(
  con,
  "SET s3_region='auto';SET s3_endpoint='';"
)
#> <duckdb_result e63e0 connection=e5160 statement='SET s3_region='auto';SET s3_endpoint='';'>

out <- duckplyr::duckplyr_df_from_file(
  "s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet",
  "read_parquet",
  options = list(hive_partitioning = TRUE),
  class = class(tibble())
) |>
  filter(total_amount > 0L) |>
  filter(!is.na(passenger_count)) |>
  mutate(tip_pct = 100 * tip_amount / total_amount) |>
  summarise(
    avg_tip_pct = median(tip_pct),
    n = n(),
    .by = passenger_count
  ) |>
  arrange(desc(passenger_count))

out |>
  explain()
#> ┌───────────────────────────┐
#> │          ORDER_BY         │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │          ORDERS:          │
#> │           #3 ASC          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> │     -(passenger_count)    │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       HASH_GROUP_BY       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             #0            │
#> │         median(#1)        │
#> │        count_star()       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │         tip_amount        │
#> │        total_amount       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │(r_base::>(total_amount, 0)│
#> │ AND (NOT ((passenger_count│
#> │   IS NULL) OR isnan(CAST  │
#> │(passenger_count AS DO...  │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 17623488       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       READ_PARQUET        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        total_amount       │
#> │      passenger_count      │
#> │         tip_amount        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 88117440       │
#> └───────────────────────────┘

Created on 2024-05-07 with reprex v2.1.0

@krlmlr
Copy link
Member

krlmlr commented May 7, 2024

Josiah: duckplyr operates directly on data frames, it never creates persistent tables in duckdb's table store. The location of the database doesn't play that much of a role. The DBI equivalents are perhaps duckdb::duckdb_register() (create a view of the data frame in the database) vs. DBI::dbWriteTable() (write a table to the table store). You can write the results of a duckplyr operation directly to Parquet with df_to_parquet() without materializing it as a data frame. However, as soon as duckplyr encounters an operation it can't translate, it will fall back to data frames. Set the DUCKPLYR_FORCE environment variable to avoid this.

@PMassicotte

This comment was marked as off-topic.

@krlmlr

This comment was marked as off-topic.

@krlmlr
Copy link
Member

krlmlr commented Jun 30, 2024

When documenting, need to mention that we never generate SQL: #132.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants