Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To pipe or not to pipe? #16

Open
davidagold opened this issue Aug 10, 2016 · 16 comments
Open

To pipe or not to pipe? #16

davidagold opened this issue Aug 10, 2016 · 16 comments

Comments

@davidagold
Copy link
Owner

davidagold commented Aug 10, 2016

This issue does not concern the "one-off" macros (@select, etc.), which continue to be defunct. Rather, it concerns syntax within the @query macro. Currently, one conveys the intention to pipe the result of one query command to the next with the use of the pipe operator:

qry = @query tbl |>
    filter(name == "Niamh") |>
    select(age)

But strictly speaking this is unnecessary. The pipe operator is seen by the macro, which could just as easily see the separation of Exprs within a :block Expr. Indeed, I can see three (EDIT: four) reasons to remove the use of pipes within the @query macro:

  1. Minimize the number of function calls that the user expresses within @query but which are never actually run.
  2. Minimize keystrokes for the user. (see @tcovert 's comment below)
  3. The future of |> is uncertain anyway, so perhaps it is best not to rely on it to convey any one particular thing.
  4. Less clutter

The one good reason I can see for keeping |> is that it makes the intention of piping data explicit. But this could be served just as well by the newline once that is established as a convention. So, would people prefer

qry = @query tbl
    filter(name == "Niamh")
    select(age)

?? I'm leaning that way myself.

EDIT: This could also remove the need to have @query and @qcollect. We could make it that one-line @query invocations collect automatically, whereas multiline invocations return a graph. That is,

@query filter(tbl, name == "Niamh")

would automatically collect, whereas

@query tbl
    filter(name == "Niamh")

would return a graph.

EDIT EDIT: I suppose the above suggestion could be carried out with |> as well. I just happened to think of it while writing this issue.

@tcovert
Copy link

tcovert commented Aug 11, 2016

Maybe this is a silly question but without some kind of piping operator, how would a text editor know when to stop indenting a query expression? For that matter, how would the interpreter know when the query expression is over?

@davidagold
Copy link
Owner Author

Actually, that's not a silly question at all. That's an excellent observation. So, it'd be instead

qry = @query tbl begin
    filter(name == "Niamh")
    select(age)
end

Hmm. Well, now I don't know. Still not bad.

@tcovert
Copy link

tcovert commented Aug 11, 2016

8 character fixed cost vs a 2 character variable cost. how many verbs are in the typical query?

the begin/end syntax matches whats in Lazy.jl and the @byrow! macro in DataFramesMeta.jl

@nalimilan
Copy link

I think I'd prefer the begin... end version, which is more standard in Julia than repeating |> at the end of each line. Also, if you count the number of syntax markers instead of the number of characters (which is another interesting metric of cognitive load), begin... end is a fixed cost of 2 words, while |> is a variable cost of at least 2 "words".

@johnmyleswhite
Copy link
Collaborator

I think the big question is whether we ever intend to support functions that have more than one argument that needs to come from the previous step in the computation. If so, we might need something more complicated than line breaks. If not, I agree with @nalimilan that minimizing typing is nice.

@davidagold
Copy link
Owner Author

@johnmyleswhite But in that case we'd also need something more complicated than pipes, too, right? Can you give an example of such a situation?

@johnmyleswhite
Copy link
Collaborator

I don't have one offhand, but I imagine we'll need to be careful with things like:

SELECT
    x
FROM
    table1
WHERE
    y IN (SELECT y FROM table2 WHERE y > 0)

@richardreeve
Copy link

As a sad old-fashioned unix type, I am extremely fond of pipes, and I was delighted when magrittr introduced them into R. I would be equally delighted if they were introduced into julia. However, they are simple, unix tools - "do one job, do it well" - and do not obviously improve the situation when there are multiple inputs (although I still use them then!). As such @davidagold is right that it doesn't help in @johnmyleswhite's example. I also think that minimising typing is a distraction.

The question (for me, and maybe not for others) is whether we want to introduce new syntax into the language in general. Is |> an infix operator that means "take the LHS and insert it as the first argument of the RHS of this operator"? That is what magrittr does in R:

> max(1, 2)
[1] 2
> runif(1)
[1] 0.1256291

becomes

> library(magrittr)
> 1 %>% max(2)
[1] 2
> 1 %>% runif
[1] 0.1256291

I think this would be great, and I would be wholeheartedly behind it, and it would enhance the language. If not, then having this weird pipe operator that only worked in the context of @query calls would be a terrible mistake, and even as a pipe enthusiast I would discourage it...

Just my 2¢.

@davidagold
Copy link
Owner Author

Is |> an infix operator that means "take the LHS and insert it as the first argument of the RHS of this operator"?

Yes.

julia> 5 |> x -> x + 1
6

@yeesian
Copy link
Collaborator

yeesian commented Sep 1, 2016

@johnmyleswhite But in that case we'd also need something more complicated than pipes, too, right? Can you give an example of such a situation?

We might also want to consider reserving block statements for the analog of common-table-expressions (CTEs), e.g.

result = @query source(s) begin
    x = source(s) |> ... # x is an alias for a subquery/CTE
    y = x/source(s) |> ... # y is an alias for a subquery/CTE
    x/y/source(s) |> ... # the last expression is the result
end

@davidagold
Copy link
Owner Author

davidagold commented Sep 1, 2016

^ In which case we want both the pipe operator and block expressions.

EDIT: John's example could possibly be expressed as

qry = @query begin
    subq = table2 |>
        filter(y > 0) |>
        select(y)
    table1 |>
        filter(y in subq) |>
        select(x)
end

The pipes are kind of noisy.

EDIT^2: Actually, one could analyze the data flow through

qry = @query begin
    subq = table2
        filter(y > 0)
        select(y)
    table1
        filter(y in subq)
        select(x)
end

based purely on the contents of the block expressions alone. An expression consisting solely of a symbol (e.g. table1) or of an assignment (e.g. subq = table2) -- call these data source expressions -- could signal the start of data flow through subsequent manipulation verbs until the next data source expression or the end of the block is reached.

@yeesian
Copy link
Collaborator

yeesian commented Sep 1, 2016

a counter-argument to my proposal will be the possibility of having the assignments/etc done outside of the macro, rather than inside the macro, i.e.

subq = @query table2 |>
        filter(y > 0) |>
        select(y)
qry = @query table1 |>
        filter(y in subq) |>
        select(x)

or correspondingly,

subq = @query table2 begin
        filter(y > 0)
        select(y)
end
qry = @query table1 begin
        filter(y in subq)
        select(x)
end

@richardreeve
Copy link

@davidagold Good point. I had never noticed that |> was in core julia, and now I'm feeling a bit stupid. I had only tried it as:

julia> max(5, 6)
6

julia> 5 |> max(6)
ERROR: MethodError: objects of type Int64 are not callable
 in |>(::Int64, ::Int64) at ./operators.jl:345

and not thought through what the error message was. Given that is true, and looking at your followup comment, I would definitely advocate @davidagold's EDIT to EDIT^2, because the latter only makes sense in the context of the whitespace to me.

However, the problem with this version of pipes (not an R-like macro and without unix-like -X option syntax) is that presumably it becomes very hard to write functions that use pipe inputs because a separate method has to be written for every non-pipe method that lacks the first argument and returns a function instead of a result. Or am I misunderstanding this?

@davidagold
Copy link
Owner Author

I would definitely advocate @davidagold's EDIT to EDIT^2, because the latter only makes sense in the context of the whitespace to me.

I understand your concern, but I do rather like the look of the EDIT^2 version.

Or am I misunderstanding this?

No, you're right -- it's not ideal.

a counter-argument to my proposal will be the possibility of having the assignments/etc done outside of the macro, rather than inside the macro,

@yeesian True, though I think this may be less flexible. It's not clear to me how you'd be able to generate the single SQL statement from John's example if you make two separate macro invocations. Part of the difficulty is that there's no type information for subq in the second macro (indeed, without interpolation syntax it would be interpreted as a column name). On the other hand, if subq = table2 ... is seen by the same @query call that sees table1 ..., then it can register the LHS of subq = table2 as an alias for a subquery and act on each subsequent instance of the symbol subq accordingly. This may make a difference if the result of collecting on subq as you've defined it above doesn't fit in memory.

@yeesian
Copy link
Collaborator

yeesian commented Sep 2, 2016

True, though I think this may be less flexible.

Which matches the intuition that there might be things we want to express in a block statement which is otherwise hard/impossible to express with the piping proposal. It remains unclear exactly what that is though.

Part of the difficulty is that there's no type information for subq in the second macro (indeed, without interpolation syntax it would be interpreted as a column name).

Yeah, that's true. What about the case when there's interpolation syntax for it?

@davidanthoff
Copy link

This is how these latest examples look in Query:

q = @from i in table1 begin
    @where i.y in @from j in table2 begin
                      @where j.y>0
                      @select j.y
                  end
    @select i.x
end

Or you can also split this into two queries:

subq = @from i in table2 begin
    @where i.y>0
    @select i.y
end

q = @from i in table1 begin
    @where i.y in subq
    @select i.x
end

I get around the type issue for the split case by having a two-pass system: the macros only syntactically transform the queries into a series of function calls, those function calls generate an object graph that has type information in it, and the end result is that the object graphs that gets created by the first and second example are identical (modulo one unimportant small difference). Not sure whether something similar could be done here, but maybe that would also work for jplyr.

This was referenced Sep 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants