Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrized queries/interpolation #22

Open
davidagold opened this issue Sep 5, 2016 · 11 comments
Open

Parametrized queries/interpolation #22

davidagold opened this issue Sep 5, 2016 · 11 comments

Comments

@davidagold
Copy link
Owner

This issue is a proper home for the discussion that ended up emerging in #20 re: syntax and semantics for parametrized queries (i.e. (?) prepared statements) and interpolation (the original issue concerned the functionality of extending existing Querys via @query).

By interpolation we mean the ability for users to designate a name within an @query invocation as referring to a value in the scope surrounding the @query context. For instance, letting $ denote an interpolated name (value?), the following query

c = .5
@query filter(tbl, A > $c)

would be equivalent to

@query filter(tbl, A > .5)

This sort of functionality is (I think) relatively straightforward. However, it is not the most efficient means of satisfying what I expect will be the dominant use patterns involving this functionality. These use patterns involve collecting a Query over an array of values for query parameters, e.g.

results = Array{Float64}()
for c in consts
    res = @collect tbl |>
        groupby(A > $c) |>
        summarize(avg_B = mean(B))
    push!(results, res)
end

The above is suboptimal because it generates a new Query object, with a new graph and new QueryHelpers for each value of c, where the structure of each of the foregoing objects is actually invariant. What would be most efficient would be to generate a single Query with a parameter c that can be assigned values from consts without having to regenerate the Query each time. This requires syntaxes that (i) allow users to designate parameters within their queries and (ii) allow users to bind parameters to specific values before collecting the Querys. I envision something like the following in place of the interpolation solution above:

results = Array{Float64}()
qry = @query tbl |>
    groupby(A > $c)
    summarize(avg_B = mean(B))
for c in consts
    bind!(qry, c = c)
    res = collect(qry)
    push!(results, res)
end

My plan is that parameter information will be stored in some mutable aspect of the Query object, say in an object of type Parameters, and that users will use bind! will to bind each parameter to a specific value via keyword arguments, where each key specifies the name of the parameter to bind.

It will be important, at least for collect machinery for in-memory Julia tabular data structure, that pulling a query parameter from a Query object and passing it, say, to a filtering kernel (lambda) be type-inferable in order to avoid boxing the result of applying the lambda to the row and parameter arguments.

In the normal case of applying such a lambda just to the values of an iterator over Table rows, (i.e. without any query parameters), type information about such row values (which are tuples) is propagated through the type of the iterator, which is just the result of ziping together the individual columns that are arguments to the filtering lambda. That is to say, the type parameters of the zipped iterator convey the element types of the relevant columns of the Table data source, since those element types are themselves represented as type parameters of the columns (which are NullableArrays). As long as this iterator is passed through a function barrier before being iterated over, type inference can identify the type of the argument (a tuple) passed to the filtering kernel.

In the case of a parametrized query, an obvious implementation of this functionality involves passing the parameters from the appropriate field of the Query object (after they've been bound) to, in this case, the filtering kernel as a second argument tuple. However, if the mutable structure in which the parameters are stored does not convey the type information of the parameters, then this information may not propagate to the point at which the kernel is actually applied to the row-tuple and the parameter-tuple, which may result in the compiler boxing the result of applying the kernel. Of course, we may assume that a filtering kernel always returns a Nullable{Bool}, but in the case of, say, a select kernel we can't make any such assumptions. I think (but could be wrong about this) that the only sure-fire way to avoid such boxing is to convey the type information of the parameters in the type of the Parameters object that wraps them.

Thus Parameters should be a parametric type, and in order for us to generate useful type parameters, we need to know the types of the parameters before they are ever bound in the Query. This suggests a syntax for declaring query parameters and their types. A natural candidate is type assertion:

qry = @query filter(tbl, A > c::Float64) # we declare c to be an unbound query parameter of type Float64

This syntax choice would preclude use of the :: type assertion syntax in query arguments. I would be okay paying this price.

It may turn out, however, that we can find a way to communicate the type information of parameters to the compiler without storing it in the Parameters wrapper, in which case $ syntax for designating parameters would be sufficient.

Note that query parametrization is a solely a matter of designating parameters within "query arguments", that is, non-data source arguments to manipulation verbs within an @query invocation. The present package achieves an analogue of parametrization for data sources by means of dummy sources. As mentioned in comments in #20, this syntactic distinction reflects a conceptual distinction: one collects the same query against multiple backends by means of dummy sources, and one collects the a query with constant structure but varying values against a fixed backend by using parametrization.

Thoughts?

@davidanthoff
Copy link

Not sure this can also work for jplry, but in Query I'm using closures for this kind of scenario:

data = collect(1:10)
c = 3
q = @from i in data begin
    @where i>c
    @select i
end

for j in 1:5
    c = j
    println(collect(q))
end

does what you would want it to do, and it does so by creating a closure that captures c, so during collection I can always get the current value of c.

@yeesian
Copy link
Collaborator

yeesian commented Sep 6, 2016

I can see one-off queries that uses interpolations being a sufficiently common use-case, that combines the steps of preparing statements and binding parameters. The third of my proposal(s) is whether we should have

qry = @query tbl |>
    where(A > $c)
    summarize(avg_B = mean(B))

as a stand-in for

qry = @query tbl |>
    where(A > :c)
    summarize(avg_B = mean(B))
bind!(qry, c = c)

Responding to the issues that have surfaced:

parsing the first argument

I do like having the syntax reflect the conceptual distinction between collecting a (fixed) query against different sources and collecting a prepared query with varying parameter values against a fixed source.

My preference is for us to not make the parsing of the first argument ("primal data source"?) too special or different from the parsing of the query arguments. Because conceptually we could view

@query src |> filter(...) |> select(...)

as

@query from(src) |> filter(...) |> select(...)

in which case src can/should be thought of as a query argument, so that its appearance isn't too different from sources that appear in (e.g. join) verbs, or possibly correlated subqueries in the future.

specifying data type for parameterized queries

The second point concerning using : for both dummy sources and parametrized queries is that I think the syntax for the latter may need to include some way of specifying the type of values that the parameter will take.

If we want to provide type information, can we do $(x::DataType) or :(x::DataType) in place of $x or x::DataType?

using : to designate query parameters

using : to designate query parameters, [...] renders the user unable to talk about Symbol literals in query arguments

That's true. But users can also write

qry = @query tbl |> filter(A == :c)

as

qry = @query tbl |> filter(A == symbol("c"))

to disambiguate :c from quoting expressions. The choice of : is somewhat arbitrary -- since it feels "natural" and is semantically consistent with the introduction of dummy sources -- there's the possibility of switching to ? or other prefixes.

@davidagold
Copy link
Owner Author

@davidanthoff Yeesian and I have experimented with using closures, but we had some difficulty getting type inference to correctly predict the return type of a lambda that produces the captured value in cases where that value is re-assigned, even inside a function. Compare the following:

# this is fine
function test1()
    c = 1
    λ = () -> c
    @code_warntype λ()
end

# this is not fine
function test2()
    c = 1
    λ = () -> c
    for _c in [1, 2, 3]
        c = _c
        @code_warntype λ()
    end
end

The only way I've thought of to fix this to wrap the lambda in some Parameters type that conveys the type information of the stored value -- essentially the solution above. You might be able to get some ways with a simple typeassert, but I don't think this will avoid having initially to box the return value from the lambda. I think this is a difference between the lambda approach and a mutable wrapper approach. In the first, type inference thinks, "I have no idea what type c is when it's produced inside the lambda, but I do know that when I return it from the lambda it'll be converted to a T". In the second, type inference knows that the only type of object that can be stored in the container is an object of type Nullable{T}, so (hopefully) no boxing is necessary anywhere.

The question then is, how should users declare the type of c? The lambda approach requires that users initialize a value for c beforehand, and one could use the value of this initial c to set the type parameter of the container, or the typeassert on the lambda. Or, since in our world we need to designate such names as not referring to columns anyway, we could use the same syntax that designates the name as not referring to a column also to designate the type of that object and thereby declare it as an uninitialized query parameter.

@davidanthoff
Copy link

@davidagold I think these are all instances of JuliaLang/julia#15276, so I'm kind of hoping that at some point that gets fixed and then the closure approach should always work :) I'll add your latest example to that issue as well, if you don't mind.

@davidagold
Copy link
Owner Author

@davidanthoff Please do.

@davidagold
Copy link
Owner Author

@yeesian I should be clearer. By a "query argument" I just mean an expression (that may consist solely of a variable name) in which unadorned names are assumed to refer to attributes (column names) of a pre-specified data source. So, in filter(tbl, A > .5), the argument A > .5 is a query argument, since in that expression we assume that all unadorned names (i.e. A) refer to attributes of a pre-specified source (i.e. tbl). But there need to be contexts within an @query invocation in which we don't assume that unadorned names refer to column names of a pre-specified data source, i.e. when that pre-specified data source is designated. It's not so much that the data source is special, it's that the context in which it is passed as an argument to a manipulation verb is different from that in which an argument like A > .5 is passed to a manipulation verb. In the latter context, the name A is assumed to refer to an attribute, but in the former the name tbl is not.

The same is true for joins. In the following,

@query join(tbl1, tbl2, A == B)

the first two arguments of join are not assumed to be attributes of any pre-specified data sources. However, in the third argument, A == B, A and B are assumed to be attributes of the two data sources.

I'm realizing now that the present discussion needs to be part of a larger discussion about how names are resolved inside of a "query context" in an @query invocation, since this will determine when interpolation is necessary. It may be best for now to just stick with $ interpolation and leave a parametrized query API for later. The former can be used for parametrized queries, with the type inference difficulties mentioned above. The latter may in fact be less of a problem than I originally thought, depending on the effects of well-placed function barriers -- though it will probably still incur some amount of boxing.

@davidanthoff
Copy link

@davidagold I know you previously wrote that jplyr would work with any data source, not just things that have a column like structure, but I don't understand how that works with the syntax here. It seems that e.g. the filter function only works with things that have attributes, or not? But in this context isn't that just a different name for columns?

How would one for example query an array of Ints and then filter out the even ones? Or say you have an array of some custom type, and a function that takes such a custom type as input and returns a Bool, how would you express a filter that is based on evaling that function for each element of the source?

@davidagold
Copy link
Owner Author

davidagold commented Sep 13, 2016

Those are good questions. It really comes down to defining a collection machinery that implements the semantics you're looking for. Part of this is understanding what is stored in a QueryNode. Let's take querying an Array{Int} and filtering out even elements as an example.

julia> using jplyr

julia> A = collect(1:10);

julia> qry = @query filter(A, isodd(a))
Query with Array{Int64,1} source

If I look at the qry.graph I see the above query represented in terms of a FilterNode:

julia> dump(qry.graph)
jplyr.FilterNode
  input: jplyr.DataNode
    input: #undef
  args: Array{Union{Expr,Symbol}}((1,))
    1: Expr
      head: Symbol call
      args: Array{Any}((2,))
        1: Symbol isodd
        2: Symbol a
      typ: Any
  helpers: Array{jplyr.FilterHelper{F}}((1,))
    1: jplyr.FilterHelper{##3#4}
      f: #3 (function of type ##3#4)
      arg_fields: Array{Symbol}((1,))
        1: Symbol a

The FilterNode contains information about the input data source (in this case, a DataNode signifies that a source, rather than the results of another manipulation, are being fed to the filter), the raw query argument isodd(a) as an Expr , and an array of FilterHelpers. Each FilterHelper represents a predicate argument to the filter verb. So in this case there's just one. It contains a function that represents the structure of the predicate isodd(a), with some modifications. I can use this function to extend the default collection machinery and thereby define behavior for collecting an Array{Int} against a FilterNode:

julia> function jplyr._collect(A::Array{Int}, q::jplyr.FilterNode)
           res = Array{Int}(0)
           f = q.helpers[1].f
           for a in A
               f((a,)) && push!(res, a)
           end
           return res
       end

julia> collect(qry)
5-element Array{Int64,1}:
 1
 3
 5
 7
 9

In these semantics, the name a as it appears in the query is just a placeholder for an element of A. As the provider of these querying semantics, I'd make sure to document this behavior. You're right that in the context of querying against tabular data structures, the name a is taken to refer to an attribute of the table. But that is just a particular semantics provided by the collection machinery defined on tabular data structures.

It would probably be useful to add some sort of syntax akin to what you have in Query.jl, e.g.

qry = @query a in A, b in B begin
    filter(a * b > .5)
end

since without as much there's no way to enforce coherence of placeholder names from verb to verb. But it's not strictly necessary, as evidenced above.

Does this help?

EDIT: This also perhaps isn't the most on-topic discussion.

EDIT2: But it is making me think about some interesting syntax issues/enhancements, so I do appreciate it.

@davidanthoff
Copy link

Thanks for the explanation!

@davidagold
Copy link
Owner Author

My pleasure.

@davidagold
Copy link
Owner Author

davidagold commented Sep 14, 2016

Also, I suppose I was saying things like "assumed to be an attribute" above =p So really, I should generalize the vocabulary that deals with names in query arguments. Rather than saying that names in the context of query arguments are assumed to be attributes, I should say that they are assumed to refer to some aspect of the data source, where this aspect is (generally) an attribute when the data source is a tabular data structure. So, David's comment is really more relevant to the present discussion than I initially acknowledged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants