-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
difference between duckplyr
and dbplyr
?
#145
Comments
Thanks, good question. I've started to add the following content to
|
This high-level blog post is a good intro too: https://posit.co/blog/duckplyr-dplyr-powered-by-duckdb/ . |
Thanks a lot for the clarification. Just to check, when you say:
Does this mean that when run into a function that is not available in |
True. You can set the |
I came here to seek this specific point of clarification. My understanding (through intuition and reading docs) is that duckdplyr is strictly for the in-memory DuckDB database and it does not support out-of-core operations. Is that safe to say? So, if one is working with larger than memory data, they should consider using |
Here's how we're connecting to duckdb: duckplyr:::create_default_duckdb_connection
#> function() {
#> drv <- duckdb::duckdb()
#> con <- DBI::dbConnect(drv)
#>
#> DBI::dbExecute(con, "set memory_limit='1GB'")
#> DBI::dbExecute(con, paste0("pragma temp_directory='", tempdir(), "'"))
#>
#> duckdb$rapi_load_rfuns(drv@database_ref)
#>
#> for (i in seq_along(duckplyr_macros)) {
#> sql <- paste0('CREATE MACRO "', names(duckplyr_macros)[[i]], '"', duckplyr_macros[[i]])
#> DBI::dbExecute(con, sql)
#> }
#>
#> con
#> }
#> <bytecode: 0x1135cd480>
#> <environment: namespace:duckplyr> Created on 2024-05-07 with reprex v2.1.0 The memory is limited, we enable the temporary directory. We also support processing from and to files with What do you mean by "support out-of-core operations"? |
Thanks @krlmlr. I'm referring to this part of the DuckDB documentation My understanding is that by only being able to use the default driver As I understand it, one of the main motivating points of DuckDB is this capability. |
This is interesting, I am also playing with both packages. I have noticed some difference in performance between these two libraries. I found out that Did I miss something when exploring |
Thanks, Philippe, interesting. With current duckplyr, I'm seeing that the filter is not pushed down to Parquet. Could that play a role? What does the plan look like for dbplyr? options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(DBI)
con <- duckplyr:::get_default_duckdb_connection()
dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result e60e0 connection=e5160 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(
con,
"SET s3_region='auto';SET s3_endpoint='';"
)
#> <duckdb_result e63e0 connection=e5160 statement='SET s3_region='auto';SET s3_endpoint='';'>
out <- duckplyr::duckplyr_df_from_file(
"s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet",
"read_parquet",
options = list(hive_partitioning = TRUE),
class = class(tibble())
) |>
filter(total_amount > 0L) |>
filter(!is.na(passenger_count)) |>
mutate(tip_pct = 100 * tip_amount / total_amount) |>
summarise(
avg_tip_pct = median(tip_pct),
n = n(),
.by = passenger_count
) |>
arrange(desc(passenger_count))
out |>
explain()
#> ┌───────────────────────────┐
#> │ ORDER_BY │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ ORDERS: │
#> │ #3 ASC │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ avg_tip_pct │
#> │ n │
#> │ -(passenger_count) │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ avg_tip_pct │
#> │ n │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ HASH_GROUP_BY │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ #0 │
#> │ median(#1) │
#> │ count_star() │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ tip_pct │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ tip_pct │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ passenger_count │
#> │ tip_amount │
#> │ total_amount │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ FILTER │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │(r_base::>(total_amount, 0)│
#> │ AND (NOT ((passenger_count│
#> │ IS NULL) OR isnan(CAST │
#> │(passenger_count AS DO... │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ EC: 17623488 │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ READ_PARQUET │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ total_amount │
#> │ passenger_count │
#> │ tip_amount │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ EC: 88117440 │
#> └───────────────────────────┘ Created on 2024-05-07 with reprex v2.1.0 |
Josiah: duckplyr operates directly on data frames, it never creates persistent tables in duckdb's table store. The location of the database doesn't play that much of a role. The DBI equivalents are perhaps |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
When documenting, need to mention that we never generate SQL: #132. |
I find it quite convenient to use
duckdb
as a backend ofdplyr
(throughdbplyr
). All you need to do is to specify aduckdb
connection, and read data throughduckdb
's function. Then you can manipulatedf
usingdplyr
. e.g.,So what can
duckplyr
do thatdbplyr
can't?The text was updated successfully, but these errors were encountered: