-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Study of dplyr SQL generation #5
Comments
What about this going into the abstract tables interface instead? Then Jplyr can just call those generic methods. That would be more consistent with dataframes and DBs with specific SQL dialects can just implement the interface themselves. Though it seems to require a dag/lazy eval, which brings us back to this package. |
Let's worry about that stuff later. There's so much work to be done here on just representing things with a DAG. Once that's done, it will be easier to address extensions. |
Though I do think it is worth keeping the organization of AbstractTables vis a vis this package in the back of our minds. We may want to represent the notion of a @datnamer I'm not quite sure I understand the suggestion, and I'd like to understand it even though I agree it probably won't be relevant until some time later. |
@davidagold - You pretty much got it there. |
Following up on Part 2, to answer:
by tracing the sequence of calls from > my_db <- src_sqlite("my_db.sqlite3", create = TRUE)
> library(nycflights13)
> flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum")) src_sqlite()Examining src_sqlite <- function(path, create = FALSE) {
if (!requireNamespace("RSQLite", quietly = TRUE)) {
stop("RSQLite package required to connect to sqlite db", call. = FALSE)
}
if (!create && !file.exists(path)) {
stop("Path does not exist and create = FALSE", call. = FALSE)
}
con <- DBI::dbConnect(RSQLite::SQLite(), path)
RSQLite::initExtension(con)
src_sql("sqlite", con, path = path)
} We see that src_sql <- function(subclass, con, ...) {
subclass <- paste0("src_", subclass)
structure(list(con = con, ...), class = c(subclass, "src_sql", "src"))
} under the subclass > str(my_db)
List of 2
$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
.. ..@ Id :<externalptr>
.. ..@ dbname : chr "my_db.sqlite3"
.. ..@ loadable.extensions: logi TRUE
.. ..@ flags : int 6
.. ..@ vfs : chr ""
$ path: chr "my_db.sqlite3"
- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src" copy_to()We now trace the sequence of calls in flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum")) We begin with copy_to.src_sql <- function(dest, df, name = deparse(substitute(df)),
types = NULL, temporary = TRUE,
unique_indexes = NULL, indexes = NULL,
analyze = TRUE, ...) {
assert_that(is.data.frame(df), is.string(name), is.flag(temporary))
class(df) <- "data.frame" # avoid S4 dispatch problem in dbSendPreparedQuery
if (isTRUE(db_has_table(dest$con, name))) {
stop("Table ", name, " already exists.", call. = FALSE)
}
types <- types %||% db_data_type(dest$con, df)
names(types) <- names(df)
con <- dest$con
db_begin(con)
on.exit(db_rollback(con))
db_create_table(con, name, types, temporary = temporary)
db_insert_into(con, name, df)
db_create_indexes(con, name, unique_indexes, unique = TRUE)
db_create_indexes(con, name, indexes, unique = FALSE)
if (analyze) db_analyze(con, name)
db_commit(con)
on.exit(NULL)
tbl(dest, name)
}
Here's the cascade of calls from tbl.src_sqlite <- function(src, from, ...) {
tbl_sql("sqlite", src = src, from = from, ...)
} tbl_sql <- function(subclass, src, from, ..., vars = attr(from, "vars")) {
make_tbl(
c(subclass, "sql", "lazy"),
src = src,
ops = op_base_remote(src, from, vars)
)
} make_tbl <- function(subclass, ...) {
subclass <- paste0("tbl_", subclass)
structure(list(...), class = c(subclass, "tbl"))
} which explains the hierarchy of classes op_base_remote()Let's wrap up the exploration with an examination of op_base_remote <- function(src, x, vars = NULL) {
# If not literal sql, must be a table identifier
if (!is.sql(x)) {
x <- ident(x)
}
if (is.null(vars)) {
vars <- db_query_fields(src$con, x)
}
op_base("remote", src, x, vars)
} where
op_base <- function(name, src, x, vars) {
stopifnot(is.character(vars))
structure(
list(
src = src,
x = x,
vars = vars
),
class = c(paste0("op_base_", name), "op_base", "op")
)
} |
Going forward, I think we should maybe do one/two more case studies on
For example, But before then, I think the final point of this exercise might be to decide whether to build up an (intermediate) "query object", that goes like
and the factors that goes into such a decision. Relevant reading(s): |
Maybe it would be easiest to contact hadley himself and have a chat? I bet he would be willing to give some advice. If that ever happens, please record, because that would be a fascinating discussion to watch. |
The purpose of this issue is to provide a description of the internals of dplyr's SQL generation. I intend it to serve as reference/inspiration for the development of such functionality in this package.
We first create a local SQLite3 database according to the vignette :
(We are not concerned with the generation of indices at this point.)
It is worth examining the structure of
flights_sqlite
object:We will see that the data manipulation commands provided by dplyr dispatch on the
tbl_lazy
class inflights_sqlite
's class structure and are accumulated in itsops
field.For instance, let's trace the cascade of functions resulting from a call to
The adventure starts in R/manip.R:
The filtering expressions (in this case
year == 2013
,month == 1
,day == 1
) are passed tolazyeval
, which produces the followinglazy_dots
objectwhich encodes the information (i.e. the un-evaluated filtering expressions) passed to the original
filter
command. The original data source, i.e.flights_sqlite
and.dots
are then passed to the eagerly-evaluatingfilter_
generic.To find the specific method called on
(flights_sqlite, .dots)
, we need to identify the class structure offlights_sqlite
:(TODO: understand how
flights_sqlite
comes to have this class structure.)The relevant class, i.e. the one on which
filter_
dispatches, turns out to betbl_lazy
, and so we examine the relevant method given in R/tbl_lazy.R:Since the
...
argument is empty, thedots
object returned by thelazyeval::all_dots
call is "equivalent" (i.e., has the same structure and components) to the original.dots
argument.The purpose of
filter_.tbl_lazy
is ultimately to collect the information about the filtering expressions and store it in anop
object, which is in turn stored in theops
field of the originaltbl_lazy
data source, i.e.flights_sqlite
. This takes us to the implementation ofadd_op_single
in the land ofR/lazy_ops.R
:(We note that the
ops
field offlights_sqlite
is initiated with theop_base_remote
method).op_single
in turn returns anop
object whose most specific class is determined by the manipulation command from which the call toadd_op_single
originated:Thus we find that
add_op_single
(the "single" refers to the fact thatfilter
is a single-table verb) modifies theops
field of the.data
argument, i.e.flights_sqlite
. In particular, sinceflights_sqlite$ops
is passed as thex
argument toop_single
, the resultantop_filter
object "points to" the originalop_base_remote
object initialized at creation offlights_sqlite
. In this way dplyr lazily accumulates manipulation commands applied to remote data sources in a graph structure that can then be analyzed, flattened, and translated into SQL.Indeed, if we examine the
ops
field ofc1
, we see that its value is anop_filter
object that contains both the originalop_base_remote
object as well as all the requisite information about the filtering expressions passed tofilter
above:To be continued.
The text was updated successfully, but these errors were encountered: