Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential confusion with subset #2740

Closed
matthieugomez opened this issue Apr 28, 2021 · 21 comments · Fixed by #2744 or #3032
Closed

Potential confusion with subset #2740

matthieugomez opened this issue Apr 28, 2021 · 21 comments · Fixed by #2744 or #3032
Labels
breaking The proposed change is breaking. decision question
Milestone

Comments

@matthieugomez
Copy link
Contributor

matthieugomez commented Apr 28, 2021

subset acts on vectors by default. This can introduce some confusion if users forget to use ByRow:

using DataFrames
df = DataFrame(x = [0, 1])
subset(df, :x => ==(0))
# 0×1 DataFrame
df2 = DataFrame(x = [0, missing])
subset(df2, :x => ismissing)
# 0×1 DataFrame

This happens because == and ismissing are applied to vectors

df.x == 0
# false
ismissing(df2.x)
# false

Unfortunately, this is very error prone. I was wondering whether something should be done about it.

@matthieugomez matthieugomez changed the title Potential confusiong with subset Potential confusion with subset Apr 28, 2021
@matthieugomez matthieugomez changed the title Potential confusion with subset Potential confusion with subset and == Apr 28, 2021
@matthieugomez matthieugomez changed the title Potential confusion with subset and == Potential confusion with subset Apr 28, 2021
@bkamins
Copy link
Member

bkamins commented Apr 28, 2021

We could add a check if the predicate returns AbstractVector{Bool}. This would be mildly breaking, but as you note - at least not error prone. @nalimilan - what do you think?

@bkamins bkamins added this to the patch milestone Apr 28, 2021
@bkamins bkamins added breaking The proposed change is breaking. decision labels Apr 28, 2021
@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 28, 2021

@bkamins For this fix, the downside would be that one can no longer use subset to keep groups based on summary statistics right? e.g.

using DataFrames
df = DataFrame(id = [0, 0, 1, 1], x = [-1, 1, 3, 4])
subset(groupby(df, :id), :x => (x -> sum(x) > 0))

(not saying it's a bad idea, just trying to understand the tradeoff)

@bkamins
Copy link
Member

bkamins commented Apr 28, 2021

Yes - that would be the tradeoff if we wanted to be consistent. We could theoretically apply this rule only to data frames, but I am not sure it is a good idea. Note that filter would still allow you to filter whole groups, so maybe it is OK to disallow it?

@matthieugomez
Copy link
Contributor Author

yes, i think it'd be ok to disallow it. Also cc-ing @pdeffebach

@pdeffebach
Copy link
Contributor

Yes definitely!

I was not aware that subset could filter groups, but this seems like overly-context-dependen behavior.

@bkamins
Copy link
Member

bkamins commented Apr 28, 2021

OK. It will not be super simple so I will do it once we resolve performance issues in 1.0.1 release.
Essentially we need to rewrite function calls that are not ByRow from fun into (x...) -> (res = fun(x...); @assert res isa AbstractVector{Bool}; return res) (but with a nicer message. this should not affect performance significantly but I need to make a wrapper like ByRow for it to avoid compilation latency). And docs need updating.

@pdeffebach
Copy link
Contributor

yes. But you need to account for BitVectors as well. I found that out while implementing @where.

@bkamins
Copy link
Member

bkamins commented Apr 28, 2021

julia> BitVector <: AbstractVector{Bool}
true

@ericphanson
Copy link
Contributor

Would it make sense to have a ByGroup wrapper so you could do

using DataFrames
df = DataFrame(id = [0, 0, 1, 1], x = [-1, 1, 3, 4])
subset(groupby(df, :id), :x => ByGroup(x -> sum(x) > 0))

in order to filter groups? I think though that would kind of imply that returning a vector from subset for a GroupedDataFrame would mean which groups to filter but currently it means which rows to filter from each group (I think).

@bkamins
Copy link
Member

bkamins commented Apr 29, 2021

This is what filter currently does:

filter(:x => x -> sum(x) > 0, groupby(df, :id))

I suggested above to clearly differentiate subset and filter behavior in the documentation (as it turns out in the end that they are both useful).

@bkamins
Copy link
Member

bkamins commented May 2, 2021

see #2744 for a PR implementing this.

A careful look and maybe additional test proposals would be welcome (in particular: the old behavior was tested in the previous tests - so if you could kindly have a look at tests now and judge if they are doing what we want would be great).

@bkamins bkamins reopened this Feb 27, 2022
@bkamins bkamins modified the milestones: patch, 1.4 Feb 27, 2022
@bkamins
Copy link
Member

bkamins commented Feb 27, 2022

I have reopened this issue as it is constantly raised by users. What @matthieugomez reported is one side of the situation. The other side is that by disallowing scalars we do not have an easy way to filter whole groups. See e.g. https://stackoverflow.com/questions/71286257/filter-grouped-dataframe-in-julia for a recent discussion.

Let us discuss it again before mating a 1.4 release.

CC @pdeffebach @nalimilan

@bkamins
Copy link
Member

bkamins commented Feb 27, 2022

maybe we should add a kwarg that would allow scalars i.e. we would have broadcast::Bool=false by default but if you pass true you allow pseudo-broadcasting of scalars?

@nalimilan
Copy link
Member

That's a possibility, though it's not super user-friendly and not consistent with e.g. select. Maybe requiring AbstractVector was too strict? If we now consider that filter is not a convenient enough function given that it doesn't fit well with @chain blocks, and recommend users to rely on subset to filter groups, it could make sense to broadcast scalars by default.

Do we have evidence that it creates lots of confusion? The examples provided in the description seem legitimate, but they give so obviously incorrect results that at least people will notice that the syntax doesn't do what they expected.

OTOH if we keep disallowing scalar results by default but print an explicit error advising to use broadcast=true maybe that's user-friendly enough.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Mar 1, 2022 via email

@nalimilan
Copy link
Member

Warnings are not really a solution as they are annoying/confusing when you really intended to write what you wrote. So we would need a way to turn them off, in which case throwing an error would be better.

@bkamins
Copy link
Member

bkamins commented Mar 1, 2022

I agree we should not print warnings.

The crucial question is what @nalimilan asked:

Do we have evidence that it creates lots of confusion?

i.e. I know someone can make an error but how risky is this? I.e. how high is the risk that the user writes an incorrect condition and does not notice the problem. In particular, it mostly applies to ==, isequal, === and ismissing only in practical scenarios. All other cases would error. Maybe we should add an example in the docstring warning users about such cases and that would be enough?

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Mar 5, 2022

What about allowing it for GroupedDataFrame but not for DataFrame (if it’s doable)?

@bkamins
Copy link
Member

bkamins commented Mar 5, 2022

It is technically possible, but I this would break one of the most important contracts of DataFrames.jl that operations on DataFrame work the same as operations on GroupedDataFrame. I think it is important to keep it as otherwise we will get a ton of complaints about such inconsistencies.

@bkamins
Copy link
Member

bkamins commented Mar 12, 2022

Can interested people vote under this post so that I can move forward with 1.4 release? Thank you!
👍 - make subset consistent with other functions and make effort in the documentation to explain the situation
👎 - keep what we have now (note that I have opened #3021 to make filter more user friendly for this case)

@matthieugomez
Copy link
Contributor Author

I have no preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. decision question
Projects
None yet
5 participants