Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to perform by on two variables? Should we auto-splat? #1935

Closed
xiaodaigh opened this issue Aug 28, 2019 · 41 comments
Closed

How to perform by on two variables? Should we auto-splat? #1935

xiaodaigh opened this issue Aug 28, 2019 · 41 comments
Labels
breaking The proposed change is breaking.
Milestone

Comments

@xiaodaigh
Copy link
Contributor

Suppose I want to compute the correlation between two variable within each group, how do I do that e.g. I want to compute the correlation between x and y within each group of grp?

See MWE

using DataFrames
df = DataFrame(grp = rand(["a","b"], 100), x= rand(100), y = rand(100))
by(df, :grp, (:x, :y)=>cor)

I though the above would be ok but only the below works. The below also doesn't work.

by(df, :grp, (:x, :y)=>((x,y))->cor(x,y))
by(df, :grp, (:x, :y)=>(x,y)->cor(x,y))
by(df, :grp, (:x, :y)=>(xy)->cor(xy[1],xy[2))
@xiaodaigh
Copy link
Contributor Author

y(df,:grp,(:x,:y)=>xy->cor(xy...))

the above works but auto-splatting is a nicer interface IMO

by(df, :grp, (:x, :y)=>cor)

@xiaodaigh
Copy link
Contributor Author

I think the auto-splat version is more intuitive because

f = (x,y)->x+y
f(1,2)

It's not call using f((1,2)), so (:x,:y) = > cor invokes the feeling that cor is a two argument function. So auto-splatting makes more sense.

@bkamins
Copy link
Member

bkamins commented Aug 28, 2019

@nalimilan was designing this API so probably he can comment best.

The problem with auto-splatting is that it loses some information (you are not able to pass column names then to the called function).
However, I agree that auto-splatting is something I typically expect and the pattern I use very often is x -> f(x...) as you have noted. Actually I almost never need to know column names.

Also if we went for this change this would be breaking.

@nalimilan
Copy link
Member

Yes, the problem is that with the current system you can both apply splatting manually, or access columns by name (or do anything with the named tuple). As @bkamins noted, if we splat arguments automatically, there's no way to retrieve the column names. Also, in some cases you don't want splatting, e.g. to call sum on several columns.

Something we could do is use All(:x, :y) to enable automatic splatting. Or conversely deprecate the current behavior and require people to use All(:x, :y) to get a named tuple. Not sure which one is best, but note that our behavior was inspired by JuliaDB, and consistency is somewhat nice.

@bkamins
Copy link
Member

bkamins commented Aug 29, 2019

A good point. We even could add Splat (although probably reusing All is good enough).

@xiaodaigh
Copy link
Contributor Author

This is more popular than Juliadb. And also which will lead to less typing? And generally more convenient to use? It will be painful, if we switch syntax now as it may break old code, but long term gain will be worth it.

@bkamins
Copy link
Member

bkamins commented Aug 29, 2019

Therefore for backward compatibility reasons and clarity I would support Splat(:x, :y) for splatting, but using All is also OK (it is not immediately clear if All would mean splatting or not, so it is better to keep backward compatibility).

@nalimilan
Copy link
Member

One consideration to have in mind is that this should be composable with Not, Between and regexes.

@bkamins
Copy link
Member

bkamins commented Sep 1, 2019

Good point - so maybe the rule should be:

  • we could allow not only Tuple but any column selector (actually Tuple is not allowed in indexing, but we could start allowing it which would not be bad in general)
  • then we can add Splat that splats anything that is passed into it - this would be groupby specific (and it would work as All so, All could be omitted if we wanted to).

How does it sound?

@nalimilan
Copy link
Member

Actually the need for something like Splat is more general than data frames (as I realized when reviewing your PRs). Maybe we should defineSplat(f) = x -> f(x...) instead? Then you'd write by(df, :key, selector => Splat(cor)), which would work with any selector. We may file an issue in Base about this.

A related issue that was raised previously (and which I somewhat confused with the present one) is that one may want to apply a function separately to each provided column. But maybe that can be fixed independently from this one, by using All(selector) .=> function.

@bkamins
Copy link
Member

bkamins commented Sep 1, 2019

selector .=> function "just works" right now, see e.g.:

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1       │ x2        │ x3       │ x4        │
│     │ Float64  │ Float64   │ Float64  │ Float64   │
├─────┼──────────┼───────────┼──────────┼───────────┤
│ 1   │ 0.68207  │ 0.464943  │ 0.381817 │ 0.971913  │
│ 2   │ 0.954418 │ 0.0947486 │ 0.500607 │ 0.0992424 │
│ 3   │ 0.306492 │ 0.824458  │ 0.778946 │ 0.563767  │

julia> by(df, :x1, (:x2,:x3).=>sum)
3×3 DataFrame
│ Row │ x1       │ x2_sum    │ x3_sum   │
│     │ Float64  │ Float64   │ Float64  │
├─────┼──────────┼───────────┼──────────┤
│ 1   │ 0.68207  │ 0.464943  │ 0.381817 │
│ 2   │ 0.954418 │ 0.0947486 │ 0.500607 │
│ 3   │ 0.306492 │ 0.824458  │ 0.778946 │

but I have not thought about it earlier - so this is a nice idea.

Splat(f) = x -> f(x...) is also excellent - but where should it live (probably some more general place than DataFrames.jl - but we could add it here and later move to a more general place). A minor thing is that it should probably be splat then, as it is just a function (not a type).

Actually given your comments the only thing that needs to be added is this splat function.

@nalimilan
Copy link
Member

selector .=> function "just works" right now, see e.g.:

Only for simple cases, but not (as we have discussed somewhere else) for All, Not or regexes.

Splat/splat could indeed live in Base or in a fundamental package. @xiaodaigh Would you file an issue in Julia to discuss this?

@bkamins
Copy link
Member

bkamins commented Sep 2, 2019

Only for simple cases, but not (as we have discussed somewhere else) for All, Not or regexes.

True, but I guess this is not a crucial functionality (still - we can add it if you feel it would be valuable).

@nalimilan
Copy link
Member

Well it would be nice e.g. to be able to get the mean of all variables matching a pattern or all variables in a range. Not essential, but something that it would be nice to be able to do at some point.

@bkamins
Copy link
Member

bkamins commented Sep 2, 2019

Agreed (the only thing to think of is column name generation pattern)

@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

@nalimilan - is there anything left to decide here (except for Splat which is a nice idea, but I understand you would prefer to have it in base?)

@nalimilan
Copy link
Member

It would be nice to raise the Splat idea on Julia's GitHub or Discourse. If it's rejected we could add it here (or in another package).

@bkamins
Copy link
Member

bkamins commented Feb 13, 2020

https://discourse.julialang.org/t/consider-adding-splat-and-slurp-functions/34606

@dgkf
Copy link

dgkf commented Feb 14, 2020

How does this extend to situations where you want to pass, for example, the first two arguments, but also some fixed arguments?

using DataFrames
df = DataFrame(grp = rand(["a","b"], 100), x= rand(100), y = rand(100))
by(df, :grp, (:x, :y) => (x, y) -> cov(x, y; corrected = false))

My personal preference would be to use .=> for application across many columns simultaneously, or even multiple sets of columns.

# applying a unary function over multiple columns
by(df, :grp, (:x, :y) .=> maximum)

# applying a binary function over a single pair of columns
by(df, :grp, (:x, :y) => (x, y) -> cov(x, y; corrected = false))

# applying a binary function over multiple pairs of columns
by(df, :grp, ((:x, :y), (:a, :b)) .=> (x, y) -> cov(x, y; corrected = false))

@bkamins
Copy link
Member

bkamins commented Feb 14, 2020

So currently you would have to write (listing those of your examples that would need a change):

by(df, :grp, (:x, :y) => ((x, y),) -> cov(x, y; corrected = false))
by(df, :grp, ((:x, :y), (:a, :b)) .=> ((x, y),) -> cov(x, y; corrected = false))

and with splat you could write

by(df, :grp, (:x, :y) => splat((x, y) -> cov(x, y; corrected = false)))
by(df, :grp, ((:x, :y), (:a, :b)) .=> splat((x, y) -> cov(x, y; corrected = false)))

if you did not want to pass a kwarg it would be even simpler (and this is the key intended use):

by(df, :grp, (:x, :y) => splat(cov))
by(df, :grp, ((:x, :y), (:a, :b)) .=> splat(cov))

so this is only a syntactic sugar

@bkamins
Copy link
Member

bkamins commented Feb 14, 2020

@nalimilan as a separate comment, having the src => fun => target syntax would be super nice here, as then you could write something like:

by(df, :grp, ((:x, :y), (:a, :b)) .=> splat(cov) .=> (:res1, :res2))

to give names for the results. Currently, with kwarg only syntax I think it is not possible.

@dgkf
Copy link

dgkf commented Feb 14, 2020

I see - I'm starting to put these pieces together and I think I'd prefer the "auto-splatting". Conceptually it seems easier to me. If you ever need the opposite of the default behavior, it's also much cleaner to pattern match on (args...) -> than ((x, y),) -> .

The splat function seems weird to me in general. For a language with such rich native splatting syntax, I have trouble seeing the need for an obscure and inflexible higher order function. If you're set on this style, you could easily add a bit more flexibility by defining splat as

splat(f, args...; kwargs...) = x -> f(x..., args...; kwargs...)

so that you can effectively partially evaluate f with fixed arguments before receiving x. Then you could do:

# "manual-splat" (having to splat yourself)... this term is killing me >.<
by(df, :grp, ((:x, :y), (:a, :b)) .=> splat(cov; corrected = false) .=> (:res1, :res2))

# "auto-splat"
# cons: more characters, pros: more obvious/readable
by(df, :grp, ((:x, :y), (:a, :b)) .=> (x, y) -> cov(x, y, corrected = false) .=> (:res1, :res2))

@bkamins
Copy link
Member

bkamins commented Feb 14, 2020

I think I'd prefer the "auto-splatting"

I agree that this is more intuitive, however, this approach has two problems:

  1. (major) this change would be breaking in a major way (still if we wanted to do this this is the last moment to make this change before we tag 1.0, also it would influence how Add transformation and renaming to select and select! #2080 is implemented)
  2. (minor) the reason why I understand @nalimilan implemented it without splatting is that the reverse operation is not loseless as (args...) what yo have proposed drops column names. A typical intended pattern of usage of the current syntax is:
by(df, :grp, [:a, :b] => x -> x.a + x.b)

Still I personally agree that auto-splatting is more intuitive and retaining column names is not that crucial, especially that when we pass a single column we drop its name.
@nalimilan - what do you think?

Given this proposal I am marking this with 1.0 and "breaking" as we need to decide it soon.

@bkamins bkamins added breaking The proposed change is breaking. and removed non-breaking The proposed change is not breaking labels Feb 14, 2020
@bkamins bkamins added this to the 1.0 milestone Feb 14, 2020
@dgkf
Copy link

dgkf commented Feb 14, 2020

the reverse operation is not loseless as (args...) [...] drops column names
by(df, :grp, [:a, :b] => x -> x.a + x.b)

Thanks, I didn't realize this was also a target behavior. That makes sense.

My inclination is that this should be a special case - perhaps only treated this way if a Function (or Pair beginning with a Function) is passed, as the column selection seems syntactically redundant in this case (perhaps it's to avoid large tuple allocations?).

# `Function` or `Pair{Function,}`
by(df, :grp, (x -> x.a + x.b) => :z)

# `Pair{<:Any,Function}` or `Pair{<:Any,Pair{Function,}}`
by(df, :grp, (:a, :b) => ((xa, xb) -> xa + xb) => :z)

Then whenever a column selection is provided, always use the auto-splatting. Would something like that break other desired behaviors? Another concern is that this style would also make it difficult to ever support Functions as selectors without wrapping in some specific types.

@bkamins
Copy link
Member

bkamins commented Feb 14, 2020

A few comments.

Your both statements are invalid - you have to wrap a function in ( ... ) as otherwise the operator precedence makes it wrong. So you have to write e.g. by(df, :grp, (x -> x.a + x.b) => :z)

The form (x -> x.a + x.b) => :z is something we intend to add (currently it does not work); but then what is passed to a function will be a SubDataFrame not a NamedTuple (but this is a minimal difference). I think we would leave this behavior (for wide DataFrames we do not want to run into compilation overhead)

Then whenever a column selection is provided, always use the auto-splatting.

As I have said - I agree that auto-splatting is more natural. We just have to wait for @nalimilan to comment as he might have some strong arguments in favor of NamedTuples (he has spent much more time on designing this than all other people taken together 😄)

Would something like that break other desired behaviors?

It would be breaking, but I would say that (:a, :b) => x -> x.a + x.b is less convenient/intuitive than (:a, :b) => (a, b) -> a + b also when we consider broadcasting of .=> then auto-splatting is much more natural as in [(:a, :b), (:c, :d)] .=> do_something the column names would change between calls to do_something so passing the column names in a NamedTuple is not that much useful.

In summary:

  • I am in favor of auto-splatting (and thank you for raising this point)
  • when no columns are passed then SubDataFrame is passed, this is intended, and I would leave it as is
  • this move would be breaking so we should hear what @nalimilan thinks about it

@dgkf
Copy link

dgkf commented Feb 14, 2020

Your both statements are invalid.

Thanks. Updated for clarity. I hope the intent was still clear as they were just for illustrative purposes anyways.

@bkamins
Copy link
Member

bkamins commented Feb 14, 2020

Sure - I just wanted to make this comment as this is a typical error (and the problem is that the unintended semantics is silently accepted, so you can learn about the problem much later in the code execution than you make it) I was not sure you are aware of.

And just to add. Maybe we will reject the proposal to auto-splat just because it is breaking. Then your proposal of enhanced splat is very nice. I would just discuss whether it should be:

splat(f, args...; kwargs...) = x -> f(x..., args...; kwargs...)

or

splat(f, args...; kwargs...) = x -> f(args..., x...; kwargs...)

as the latter (making args... first arguments of f) is more similar to standard currying.

@bkamins
Copy link
Member

bkamins commented Feb 15, 2020

See https://stackoverflow.com/questions/60237725/julia-dataframe-same-column-in-a-multiple-input-function-with-by/ for a related question from today.

Actually it is an argument for auto-splatting, as in the question asked there it clearly makes sense in general to run a code like (I am using current syntax):

by(df, :grp, result= (col1, col2) => ((x, y),) -> cov(x, y))

where col1 and col2 are variables holding column names. Such generic code should work even if col1==col2.

With auto-splatting it would become:

by(df, :grp, result= (col1, col2) => cov)

and it should not be a problem that col1 and col2 are the same names (internally we would not care about column names - only about the vectors stored in them).

@dgkf
Copy link

dgkf commented Feb 15, 2020

That's a really interesting use case identified in that stackoverflow thread. For me it makes it clear that what I've been referring to as a selector - the first part of the Pair(s) - is serving two distinct purposes.

  1. It can be used to select columns. This could be things like All, Not, or a Regex.
  2. To build Tuples (or SubDataFrames) of arguments for transformations.

It's somewhat obvious in retrospect, but drawing a line between these has helped me conceptualize it. I suppose this is where the Col is needed to differentiate tuples that are meant to collapse multiple selectors (Col(Not([:a]), Not([:b]))) from tuples that are meant to be arguments to a multivariate transformation ((Not([:a]), Not([:b])))

More generally, this type of dual-purpose starts to make me a bit worried about the robustness of such a syntax. A lot of issues with relational columns like this, in my experience, are best solved by first reshaping to a more normalized data model and regrouping the data. At that point you don't need extremely flexible column selection syntax - although this might by my dplyr preconceptions trickling in.

@nalimilan
Copy link
Member

So yes as @bkamins said the main reason for passing a named tuple rather than splatting is that it preserves the names: so you can splat a named tuple, but conversely you cannot construct a named tuple if we pass splatted arguments. @bkamins and @piever also liked named tuples better because they are a kind of small table which can easily be manipulated. Finally, that's inspired by JuliaDB.

But indeed using named tuples may not be that common and we could change the default, introducing another syntax to request a named tuple (e.g. something like by(df, :key, Named(:x, y) => f)).

I'd say that which approach is the most convenient depends on the use case. In particular, splatting only makes sense when you write column names explicitly by hand: if you use a more complex selector, you're unlikely to know the exact number and order of variables, so you wouldn't pass them as positional argument. As examples of conflicting goals, if we splatted by default, by(df, key, (:x, :y) => cov) would indeed be very nice, but we would lose things like select(df, (:x, :y) => ByRow(sum)) and even more select(df, r("x") => ByRow(sum)) (possible with the current design, with the new select improvements).

At some point I had played with the idea of retrieving the arguments names in the Function object, so that e.g. by(df, :key, (x, y) -> cov) would be equivalent to by(df, :key, (:x, :y) => r -> cov(r.x, r.y)). But people weren't too enthusiastic about that.

Overall, I think one conclusion is that DataFrames should provide a powerful and flexible syntax, but not necessarily super convenient, and that DataFramesMeta should build on top of this. And DataFramesMeta doesn't have this problem, as you can specify to which argument each column corresponds without thinking about splatting.

More generally, this type of dual-purpose starts to make me a bit worried about the robustness of such a syntax. A lot of issues with relational columns like this, in my experience, are best solved by first reshaping to a more normalized data model and regrouping the data. At that point you don't need extremely flexible column selection syntax - although this might by my dplyr preconceptions trickling in.

What do you mean exactly? Can you develop?

@bkamins
Copy link
Member

bkamins commented Feb 15, 2020

Good points.

So maybe we should change nothing in what we have, but add the following syntax:

by(df, :grp, Splat(:x, :y) => cov)

Then Splat would be applied to column names or column indices and as opposed to standard selectors it would allow passing duplicates. Then if Splat is used on LHS of => then on RHS we pass data to the function using splatting. Having Splat on LHS is more flexible as it allows for duplicates (which is OK when we splat). What do you think?

@nalimilan
Copy link
Member

It would be interesting to collect examples of by, select and filter used with multiple input columns. That would help deciding which pattern is the more common to choose the default. Maybe we could ask on #data.

@bkamins
Copy link
Member

bkamins commented Feb 16, 2020

OK - I have asked. @nalimilan, @xiaodaigh, @dgkf - can you please comment there so that we have a single place to discuss?

@dgkf
Copy link

dgkf commented Feb 16, 2020

can you please comment there

Where exactly do you want to move the conversation? To #2080?

@bkamins
Copy link
Member

bkamins commented Feb 16, 2020

To Slack as @nalimilan suggested. See the post https://julialang.slack.com/archives/C674VR0HH/p1581873612348200

@dgkf
Copy link

dgkf commented Feb 16, 2020

Commenting here because I don't want to derail the slack conversation further than I already have

@nalimilan #1935

More generally, this type of dual-purpose starts to make me a bit worried about the robustness of such a syntax. A lot of issues with relational columns like this, in my experience, are best solved by first reshaping to a more normalized data model and regrouping the data. At that point you don't need extremely flexible column selection syntax - although this might by my dplyr preconceptions trickling in.

What do you mean exactly? Can you develop?

Expanding on both points

"this type of dual-purpose starts to make me a bit worried about the robustness of such a syntax"

It seems like these two behaviors (column selection and argument building) are difficult to build simple syntactic rules around, which starts to make them very difficult to reason about. Initially I thought a simple rule might be "if a symbol is paired to a function, apply the function to the column data. If a tuple of columns is paired with a function auto-splat the tuple as arguments," but this logic breaks down with the additional selectors (All, Not, Regex, etc). Just trying to build up a few use cases:

# simplest case, perform a function on a specific column
:x => f

# apply a function over multiple columns
(:x, :y) .=> f  

# apply a function over multiple columns by programmatic criteria
# (?) when a regular expression is passed is it assumed to behave 
#     as a collection of columns or a single column - should multiple
#     columns be splatted?
r"_id$" .=> f

# applying a function over pairwise columns by programmatic criteria
# (?) do tuples of meta-selectors collapse into a Set (equivalent to a 
#     single `All()`), or do they form a tuple of `All()` results, or do 
#     they form pairwise permutations of tuples, or zipped tuples?
(All(), All()) .=> f

I don't think there's any way around this ambiguity without introducing some new Types that indicate how the column names translate into the various meanings of these columns as you won't have access to the DataFrame to collapse down into a set of columns until you're into the body of the by function. Because of all of the ways these can be interpreted, it might be necessary to have something like Splat(:x, :y), BroadcastSplatCols(:col_id, Not([:col_id])) PermuteSplatCols(All(), All())), ZipSplatCols(r"_valueA$", r"_valueB$")). Maybe there's another solution, but I can't think of one given the way that these selectors are implemented. In the absence of something better, this starts to feel like there must be a fundamental design issue when it's necessary to embed so many data operations as Types.

"... best solved by first reshaping to a more normalized data model and regrouping the data"

When I have problems that require operation across multiple columns at once, I usually do these steps. Granted, I think their may be performance concerns of constantly reshaping very large data, but usually I'm not working in that performance-limited domain. I'll use an example where you have average various medical measures for males and females in different countries.

country height_m height_f weight_m weight_f
The Netherlands 182.5 167.5 82.7 67.8
Germany 179.9 162.8 82.4 67.5
... ... ... ... ...

Typically I would first reshape the data to be

country measure value_m value_f
The Netherlands height 182.5 167.5
The Netherlands weight 82.7 67.8
... ... ... ...

Then I can group by [:country, :measure] and perform my operation across value_m and value_f without having to worry about specifying pairwise columns. At least this is how I would make things work in the dplyr-style of doing things.

@bkamins
Copy link
Member

bkamins commented Feb 16, 2020

I think the Splat type would exactly disambiguate these two approaches and that is why I proposed to stay non-breaking and add Splat (as in Splat you can have duplicate columns etc. as you do not care about their names).

@nalimilan
Copy link
Member

Thanks for developing. So yeah, a combination of what we have now and splatting would probably cover these cases. Things like Splat(:x, r"y") calling f(::AbstractVector, ys::NamedTuple{<:Any, <:AbstractVector}) could also be allowed at some point if we wanted, and it could even be combined with ByRow.

@bkamins
Copy link
Member

bkamins commented Feb 17, 2020

@nalimilan So the decision is:

  • leave NamedTuple as default
  • add Splat in the future

Please confirm so that we can move forward with implementation.

@nalimilan
Copy link
Member

That's fine with me.

@bkamins
Copy link
Member

bkamins commented Feb 17, 2020

OK - so I am closing this in favor of #2121.

Please reopen if you feel this needs to be further discussed.

@bkamins bkamins closed this as completed Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking.
Projects
None yet
Development

No branches or pull requests

4 participants