-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to perform by
on two variables? Should we auto-splat?
#1935
Comments
y(df,:grp,(:x,:y)=>xy->cor(xy...)) the above works but auto-splatting is a nicer interface IMO by(df, :grp, (:x, :y)=>cor) |
I think the auto-splat version is more intuitive because f = (x,y)->x+y
f(1,2) It's not call using |
@nalimilan was designing this API so probably he can comment best. The problem with auto-splatting is that it loses some information (you are not able to pass column names then to the called function). Also if we went for this change this would be breaking. |
Yes, the problem is that with the current system you can both apply splatting manually, or access columns by name (or do anything with the named tuple). As @bkamins noted, if we splat arguments automatically, there's no way to retrieve the column names. Also, in some cases you don't want splatting, e.g. to call Something we could do is use |
A good point. We even could add |
This is more popular than Juliadb. And also which will lead to less typing? And generally more convenient to use? It will be painful, if we switch syntax now as it may break old code, but long term gain will be worth it. |
Therefore for backward compatibility reasons and clarity I would support |
One consideration to have in mind is that this should be composable with |
Good point - so maybe the rule should be:
How does it sound? |
Actually the need for something like A related issue that was raised previously (and which I somewhat confused with the present one) is that one may want to apply a function separately to each provided column. But maybe that can be fixed independently from this one, by using |
but I have not thought about it earlier - so this is a nice idea.
Actually given your comments the only thing that needs to be added is this |
Only for simple cases, but not (as we have discussed somewhere else) for
|
True, but I guess this is not a crucial functionality (still - we can add it if you feel it would be valuable). |
Well it would be nice e.g. to be able to get the mean of all variables matching a pattern or all variables in a range. Not essential, but something that it would be nice to be able to do at some point. |
Agreed (the only thing to think of is column name generation pattern) |
@nalimilan - is there anything left to decide here (except for |
It would be nice to raise the |
How does this extend to situations where you want to pass, for example, the first two arguments, but also some fixed arguments? using DataFrames
df = DataFrame(grp = rand(["a","b"], 100), x= rand(100), y = rand(100))
by(df, :grp, (:x, :y) => (x, y) -> cov(x, y; corrected = false)) My personal preference would be to use # applying a unary function over multiple columns
by(df, :grp, (:x, :y) .=> maximum)
# applying a binary function over a single pair of columns
by(df, :grp, (:x, :y) => (x, y) -> cov(x, y; corrected = false))
# applying a binary function over multiple pairs of columns
by(df, :grp, ((:x, :y), (:a, :b)) .=> (x, y) -> cov(x, y; corrected = false)) |
So currently you would have to write (listing those of your examples that would need a change):
and with
if you did not want to pass a kwarg it would be even simpler (and this is the key intended use):
so this is only a syntactic sugar |
@nalimilan as a separate comment, having the
to give names for the results. Currently, with kwarg only syntax I think it is not possible. |
I see - I'm starting to put these pieces together and I think I'd prefer the "auto-splatting". Conceptually it seems easier to me. If you ever need the opposite of the default behavior, it's also much cleaner to pattern match on The splat(f, args...; kwargs...) = x -> f(x..., args...; kwargs...) so that you can effectively partially evaluate # "manual-splat" (having to splat yourself)... this term is killing me >.<
by(df, :grp, ((:x, :y), (:a, :b)) .=> splat(cov; corrected = false) .=> (:res1, :res2))
# "auto-splat"
# cons: more characters, pros: more obvious/readable
by(df, :grp, ((:x, :y), (:a, :b)) .=> (x, y) -> cov(x, y, corrected = false) .=> (:res1, :res2)) |
I agree that this is more intuitive, however, this approach has two problems:
Still I personally agree that auto-splatting is more intuitive and retaining column names is not that crucial, especially that when we pass a single column we drop its name. Given this proposal I am marking this with 1.0 and "breaking" as we need to decide it soon. |
Thanks, I didn't realize this was also a target behavior. That makes sense. My inclination is that this should be a special case - perhaps only treated this way if a # `Function` or `Pair{Function,}`
by(df, :grp, (x -> x.a + x.b) => :z)
# `Pair{<:Any,Function}` or `Pair{<:Any,Pair{Function,}}`
by(df, :grp, (:a, :b) => ((xa, xb) -> xa + xb) => :z) Then whenever a column selection is provided, always use the auto-splatting. Would something like that break other desired behaviors? Another concern is that this style would also make it difficult to ever support |
A few comments. Your both statements are invalid - you have to wrap a function in The form
As I have said - I agree that auto-splatting is more natural. We just have to wait for @nalimilan to comment as he might have some strong arguments in favor of
It would be breaking, but I would say that In summary:
|
Thanks. Updated for clarity. I hope the intent was still clear as they were just for illustrative purposes anyways. |
Sure - I just wanted to make this comment as this is a typical error (and the problem is that the unintended semantics is silently accepted, so you can learn about the problem much later in the code execution than you make it) I was not sure you are aware of. And just to add. Maybe we will reject the proposal to auto-splat just because it is breaking. Then your proposal of enhanced
or
as the latter (making |
See https://stackoverflow.com/questions/60237725/julia-dataframe-same-column-in-a-multiple-input-function-with-by/ for a related question from today. Actually it is an argument for auto-splatting, as in the question asked there it clearly makes sense in general to run a code like (I am using current syntax):
where With auto-splatting it would become:
and it should not be a problem that |
That's a really interesting use case identified in that stackoverflow thread. For me it makes it clear that what I've been referring to as a selector - the first part of the Pair(s) - is serving two distinct purposes.
It's somewhat obvious in retrospect, but drawing a line between these has helped me conceptualize it. I suppose this is where the More generally, this type of dual-purpose starts to make me a bit worried about the robustness of such a syntax. A lot of issues with relational columns like this, in my experience, are best solved by first reshaping to a more normalized data model and regrouping the data. At that point you don't need extremely flexible column selection syntax - although this might by my |
So yes as @bkamins said the main reason for passing a named tuple rather than splatting is that it preserves the names: so you can splat a named tuple, but conversely you cannot construct a named tuple if we pass splatted arguments. @bkamins and @piever also liked named tuples better because they are a kind of small table which can easily be manipulated. Finally, that's inspired by JuliaDB. But indeed using named tuples may not be that common and we could change the default, introducing another syntax to request a named tuple (e.g. something like I'd say that which approach is the most convenient depends on the use case. In particular, splatting only makes sense when you write column names explicitly by hand: if you use a more complex selector, you're unlikely to know the exact number and order of variables, so you wouldn't pass them as positional argument. As examples of conflicting goals, if we splatted by default, At some point I had played with the idea of retrieving the arguments names in the Overall, I think one conclusion is that DataFrames should provide a powerful and flexible syntax, but not necessarily super convenient, and that DataFramesMeta should build on top of this. And DataFramesMeta doesn't have this problem, as you can specify to which argument each column corresponds without thinking about splatting.
What do you mean exactly? Can you develop? |
Good points. So maybe we should change nothing in what we have, but add the following syntax:
Then |
It would be interesting to collect examples of |
OK - I have asked. @nalimilan, @xiaodaigh, @dgkf - can you please comment there so that we have a single place to discuss? |
Where exactly do you want to move the conversation? To #2080? |
To Slack as @nalimilan suggested. See the post https://julialang.slack.com/archives/C674VR0HH/p1581873612348200 |
Commenting here because I don't want to derail the slack conversation further than I already have
Expanding on both points "this type of dual-purpose starts to make me a bit worried about the robustness of such a syntax"It seems like these two behaviors (column selection and argument building) are difficult to build simple syntactic rules around, which starts to make them very difficult to reason about. Initially I thought a simple rule might be "if a symbol is paired to a function, apply the function to the column data. If a tuple of columns is paired with a function auto-splat the tuple as arguments," but this logic breaks down with the additional selectors ( # simplest case, perform a function on a specific column
:x => f
# apply a function over multiple columns
(:x, :y) .=> f
# apply a function over multiple columns by programmatic criteria
# (?) when a regular expression is passed is it assumed to behave
# as a collection of columns or a single column - should multiple
# columns be splatted?
r"_id$" .=> f
# applying a function over pairwise columns by programmatic criteria
# (?) do tuples of meta-selectors collapse into a Set (equivalent to a
# single `All()`), or do they form a tuple of `All()` results, or do
# they form pairwise permutations of tuples, or zipped tuples?
(All(), All()) .=> f I don't think there's any way around this ambiguity without introducing some new Types that indicate how the column names translate into the various meanings of these columns as you won't have access to the "... best solved by first reshaping to a more normalized data model and regrouping the data"When I have problems that require operation across multiple columns at once, I usually do these steps. Granted, I think their may be performance concerns of constantly reshaping very large data, but usually I'm not working in that performance-limited domain. I'll use an example where you have average various medical measures for males and females in different countries.
Typically I would first reshape the data to be
Then I can group by |
I think the |
Thanks for developing. So yeah, a combination of what we have now and splatting would probably cover these cases. Things like |
@nalimilan So the decision is:
Please confirm so that we can move forward with implementation. |
That's fine with me. |
OK - so I am closing this in favor of #2121. Please reopen if you feel this needs to be further discussed. |
Suppose I want to compute the correlation between two variable within each group, how do I do that e.g. I want to compute the correlation between
x
andy
within each group ofgrp
?See MWE
I though the above would be ok but only the below works. The below also doesn't work.
The text was updated successfully, but these errors were encountered: