difference between `duckplyr` and `dbplyr`? #145

ywhcuhk · 2024-04-22T19:02:32Z

I find it quite convenient to use duckdb as a backend of dplyr (through dbplyr). All you need to do is to specify a duckdb connection, and read data through duckdb's function. Then you can manipulate df using dplyr. e.g.,

library(duckdb)
library(dplyr)
con <- dbConnect(duckdb())

df = tbl(con, "read_parquet('yourdata.parquet')")

So what can duckplyr do that dbplyr can't?

The text was updated successfully, but these errors were encountered:

krlmlr · 2024-04-27T19:48:29Z

Thanks, good question.

I've started to add the following content to README.Rmd but am not sure where it really belongs. A new FAQ section perhaps?

The goal of duckplyr is to provide a fully compatible drop-in replacement for dplyr that uses DuckDB as a backend for fast operation.
DuckDB is an in-process OLAP database management system.
Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way.
If dplyr and duckplyr give different results on the same input data, this is considered a bug!

The operators, functions, and data types that currently cannot be translated to duckdb are executed in R.
The set of supported operations is continuously expanding.

krlmlr · 2024-04-29T17:43:47Z

This high-level blog post is a good intro too: https://posit.co/blog/duckplyr-dplyr-powered-by-duckdb/ .

ywhcuhk · 2024-04-30T13:00:30Z

Thanks a lot for the clarification. Just to check, when you say:

Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way.

Does this mean that when run into a function that is not available in duckdb, the data will be collected and the processed in RAM? One major advantage of using duckdb is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.

krlmlr · 2024-04-30T16:04:11Z

True. You can set the DUCKPLYR_FORCE environment variable to "TRUE" to avoid this, but many operations will be unavailable then.

JosiahParry · 2024-05-06T23:36:58Z

One major advantage of using duckdb is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.

I came here to seek this specific point of clarification. My understanding (through intuition and reading docs) is that duckdplyr is strictly for the in-memory DuckDB database and it does not support out-of-core operations.

Is that safe to say? So, if one is working with larger than memory data, they should consider using duckdb("path/to/db-dir"). Is that correct?

krlmlr · 2024-05-07T03:46:53Z

Here's how we're connecting to duckdb:

duckplyr:::create_default_duckdb_connection
#> function() {
#>   drv <- duckdb::duckdb()
#>   con <- DBI::dbConnect(drv)
#> 
#>   DBI::dbExecute(con, "set memory_limit='1GB'")
#>   DBI::dbExecute(con, paste0("pragma temp_directory='", tempdir(), "'"))
#> 
#>   duckdb$rapi_load_rfuns(drv@database_ref)
#> 
#>   for (i in seq_along(duckplyr_macros)) {
#>     sql <- paste0('CREATE MACRO "', names(duckplyr_macros)[[i]], '"', duckplyr_macros[[i]])
#>     DBI::dbExecute(con, sql)
#>   }
#> 
#>   con
#> }
#> <bytecode: 0x1135cd480>
#> <environment: namespace:duckplyr>

^{Created on 2024-05-07 with reprex v2.1.0}

The memory is limited, we enable the temporary directory. We also support processing from and to files with duckplyr::*df_from_*() and duckplyr::df_to_parquet() .

What do you mean by "support out-of-core operations"?

JosiahParry · 2024-05-07T12:07:10Z

Thanks @krlmlr. I'm referring to this part of the DuckDB documentation

My understanding is that by only being able to use the default driver :memory:, the entire database is stored in RAM and does not allow for larger than memory workloads.

As I understand it, one of the main motivating points of DuckDB is this capability.

PMassicotte · 2024-05-07T13:40:50Z

This is interesting, I am also playing with both packages. I have noticed some difference in performance between these two libraries. I found out that dbplyr was a bit faster compared to duckplyr when querying parquet files on S3.

Did I miss something when exploring duckplyr ?

krlmlr · 2024-05-07T18:13:33Z

Thanks, Philippe, interesting. With current duckplyr, I'm seeing that the filter is not pushed down to Parquet. Could that play a role? What does the plan look like for dbplyr?

options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(DBI)

con <- duckplyr:::get_default_duckdb_connection()

dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result e60e0 connection=e5160 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(
  con,
  "SET s3_region='auto';SET s3_endpoint='';"
)
#> <duckdb_result e63e0 connection=e5160 statement='SET s3_region='auto';SET s3_endpoint='';'>

out <- duckplyr::duckplyr_df_from_file(
  "s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet",
  "read_parquet",
  options = list(hive_partitioning = TRUE),
  class = class(tibble())
) |>
  filter(total_amount > 0L) |>
  filter(!is.na(passenger_count)) |>
  mutate(tip_pct = 100 * tip_amount / total_amount) |>
  summarise(
    avg_tip_pct = median(tip_pct),
    n = n(),
    .by = passenger_count
  ) |>
  arrange(desc(passenger_count))

out |>
  explain()
#> ┌───────────────────────────┐
#> │          ORDER_BY         │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │          ORDERS:          │
#> │           #3 ASC          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> │     -(passenger_count)    │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       HASH_GROUP_BY       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             #0            │
#> │         median(#1)        │
#> │        count_star()       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │         tip_amount        │
#> │        total_amount       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │(r_base::>(total_amount, 0)│
#> │ AND (NOT ((passenger_count│
#> │   IS NULL) OR isnan(CAST  │
#> │(passenger_count AS DO...  │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 17623488       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       READ_PARQUET        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        total_amount       │
#> │      passenger_count      │
#> │         tip_amount        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 88117440       │
#> └───────────────────────────┘

^{Created on 2024-05-07 with reprex v2.1.0}

krlmlr · 2024-05-07T18:17:27Z

Josiah: duckplyr operates directly on data frames, it never creates persistent tables in duckdb's table store. The location of the database doesn't play that much of a role. The DBI equivalents are perhaps duckdb::duckdb_register() (create a view of the data frame in the database) vs. DBI::dbWriteTable() (write a table to the table store). You can write the results of a duckplyr operation directly to Parquet with df_to_parquet() without materializing it as a data frame. However, as soon as duckplyr encounters an operation it can't translate, it will fall back to data frames. Set the DUCKPLYR_FORCE environment variable to avoid this.

krlmlr · 2024-06-30T16:24:29Z

When documenting, need to mention that we never generate SQL: #132.

krlmlr added the documentation label Apr 27, 2024

This comment was marked as off-topic.

Sign in to view

PMassicotte mentioned this issue May 21, 2024

Filter pushdown discrepancy between duckdb and duckplyr #172

Closed

krlmlr mentioned this issue Jul 10, 2024

docs: Discuss dbplyr in README #191

Merged

krlmlr closed this as completed in #191 Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

difference between `duckplyr` and `dbplyr`? #145

difference between `duckplyr` and `dbplyr`? #145

ywhcuhk commented Apr 22, 2024

krlmlr commented Apr 27, 2024

krlmlr commented Apr 29, 2024

ywhcuhk commented Apr 30, 2024

krlmlr commented Apr 30, 2024 •

edited

Loading

JosiahParry commented May 6, 2024

krlmlr commented May 7, 2024

JosiahParry commented May 7, 2024

PMassicotte commented May 7, 2024 •

edited

Loading

krlmlr commented May 7, 2024

krlmlr commented May 7, 2024 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

krlmlr commented Jun 30, 2024

difference between duckplyr and dbplyr? #145

difference between duckplyr and dbplyr? #145

Comments

ywhcuhk commented Apr 22, 2024

krlmlr commented Apr 27, 2024

krlmlr commented Apr 29, 2024

ywhcuhk commented Apr 30, 2024

krlmlr commented Apr 30, 2024 • edited Loading

JosiahParry commented May 6, 2024

krlmlr commented May 7, 2024

JosiahParry commented May 7, 2024

PMassicotte commented May 7, 2024 • edited Loading

krlmlr commented May 7, 2024

krlmlr commented May 7, 2024 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

krlmlr commented Jun 30, 2024

difference between `duckplyr` and `dbplyr`? #145

difference between `duckplyr` and `dbplyr`? #145

krlmlr commented Apr 30, 2024 •

edited

Loading

PMassicotte commented May 7, 2024 •

edited

Loading

krlmlr commented May 7, 2024 •

edited

Loading