Potential confusion with subset #2740

matthieugomez · 2021-04-28T16:18:31Z

subset acts on vectors by default. This can introduce some confusion if users forget to use ByRow:

using DataFrames
df = DataFrame(x = [0, 1])
subset(df, :x => ==(0))
# 0×1 DataFrame
df2 = DataFrame(x = [0, missing])
subset(df2, :x => ismissing)
# 0×1 DataFrame

This happens because == and ismissing are applied to vectors

df.x == 0
# false
ismissing(df2.x)
# false

Unfortunately, this is very error prone. I was wondering whether something should be done about it.

The text was updated successfully, but these errors were encountered:

bkamins · 2021-04-28T16:53:44Z

We could add a check if the predicate returns AbstractVector{Bool}. This would be mildly breaking, but as you note - at least not error prone. @nalimilan - what do you think?

matthieugomez · 2021-04-28T17:00:51Z

@bkamins For this fix, the downside would be that one can no longer use subset to keep groups based on summary statistics right? e.g.

using DataFrames
df = DataFrame(id = [0, 0, 1, 1], x = [-1, 1, 3, 4])
subset(groupby(df, :id), :x => (x -> sum(x) > 0))

(not saying it's a bad idea, just trying to understand the tradeoff)

bkamins · 2021-04-28T17:11:55Z

Yes - that would be the tradeoff if we wanted to be consistent. We could theoretically apply this rule only to data frames, but I am not sure it is a good idea. Note that filter would still allow you to filter whole groups, so maybe it is OK to disallow it?

matthieugomez · 2021-04-28T17:17:06Z

yes, i think it'd be ok to disallow it. Also cc-ing @pdeffebach

pdeffebach · 2021-04-28T17:42:00Z

Yes definitely!

I was not aware that subset could filter groups, but this seems like overly-context-dependen behavior.

bkamins · 2021-04-28T17:49:58Z

OK. It will not be super simple so I will do it once we resolve performance issues in 1.0.1 release.
Essentially we need to rewrite function calls that are not ByRow from fun into (x...) -> (res = fun(x...); @assert res isa AbstractVector{Bool}; return res) (but with a nicer message. this should not affect performance significantly but I need to make a wrapper like ByRow for it to avoid compilation latency). And docs need updating.

pdeffebach · 2021-04-28T17:53:59Z

yes. But you need to account for BitVectors as well. I found that out while implementing @where.

bkamins · 2021-04-28T18:02:11Z

julia> BitVector <: AbstractVector{Bool}
true

ericphanson · 2021-04-28T23:10:44Z

Would it make sense to have a ByGroup wrapper so you could do

using DataFrames
df = DataFrame(id = [0, 0, 1, 1], x = [-1, 1, 3, 4])
subset(groupby(df, :id), :x => ByGroup(x -> sum(x) > 0))

in order to filter groups? I think though that would kind of imply that returning a vector from subset for a GroupedDataFrame would mean which groups to filter but currently it means which rows to filter from each group (I think).

bkamins · 2021-04-29T07:17:38Z

This is what filter currently does:

filter(:x => x -> sum(x) > 0, groupby(df, :id))

I suggested above to clearly differentiate subset and filter behavior in the documentation (as it turns out in the end that they are both useful).

bkamins · 2021-05-02T08:14:08Z

see #2744 for a PR implementing this.

A careful look and maybe additional test proposals would be welcome (in particular: the old behavior was tested in the previous tests - so if you could kindly have a look at tests now and judge if they are doing what we want would be great).

bkamins · 2022-02-27T21:12:34Z

I have reopened this issue as it is constantly raised by users. What @matthieugomez reported is one side of the situation. The other side is that by disallowing scalars we do not have an easy way to filter whole groups. See e.g. https://stackoverflow.com/questions/71286257/filter-grouped-dataframe-in-julia for a recent discussion.

Let us discuss it again before mating a 1.4 release.

CC @pdeffebach @nalimilan

bkamins · 2022-02-27T21:17:01Z

maybe we should add a kwarg that would allow scalars i.e. we would have broadcast::Bool=false by default but if you pass true you allow pseudo-broadcasting of scalars?

nalimilan · 2022-03-01T13:36:46Z

That's a possibility, though it's not super user-friendly and not consistent with e.g. select. Maybe requiring AbstractVector was too strict? If we now consider that filter is not a convenient enough function given that it doesn't fit well with @chain blocks, and recommend users to rely on subset to filter groups, it could make sense to broadcast scalars by default.

Do we have evidence that it creates lots of confusion? The examples provided in the description seem legitimate, but they give so obviously incorrect results that at least people will notice that the syntax doesn't do what they expected.

OTOH if we keep disallowing scalar results by default but print an explicit error advising to use broadcast=true maybe that's user-friendly enough.

matthieugomez · 2022-03-01T14:00:32Z

I would rather have consistency rather than special cases with hard to find options. How about displaying a warning message for the examples in my original post?

…

On Tue, Mar 1, 2022 at 8:37 AM Milan Bouchet-Valat ***@***.***> wrote: That's a possibility, though it's not super user-friendly and not consistent with e.g. select. Maybe requiring AbstractVector was too strict? If we now consider that filter is not a convenient enough function given that it doesn't fit well with @chain blocks, and recommend users to rely on subset to filter groups, it could make sense to broadcast scalars by default. Do we have evidence that it creates lots of confusion? The examples provided in the description seem legitimate, but they give so obviously incorrect results that at least people will notice that the syntax doesn't do what they expected. OTOH if we keep disallowing scalar results by default but print an explicit error advising to use broadcast=true maybe that's user-friendly enough. — Reply to this email directly, view it on GitHub <#2740 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPPPXKB2KPW2WXAYCKUGZDU5YMQJANCNFSM43XM55ZA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

nalimilan · 2022-03-01T14:26:34Z

Warnings are not really a solution as they are annoying/confusing when you really intended to write what you wrote. So we would need a way to turn them off, in which case throwing an error would be better.

bkamins · 2022-03-01T16:43:07Z

I agree we should not print warnings.

The crucial question is what @nalimilan asked:

Do we have evidence that it creates lots of confusion?

i.e. I know someone can make an error but how risky is this? I.e. how high is the risk that the user writes an incorrect condition and does not notice the problem. In particular, it mostly applies to ==, isequal, === and ismissing only in practical scenarios. All other cases would error. Maybe we should add an example in the docstring warning users about such cases and that would be enough?

matthieugomez · 2022-03-05T17:59:04Z

What about allowing it for GroupedDataFrame but not for DataFrame (if it’s doable)?

bkamins · 2022-03-05T19:12:11Z

It is technically possible, but I this would break one of the most important contracts of DataFrames.jl that operations on DataFrame work the same as operations on GroupedDataFrame. I think it is important to keep it as otherwise we will get a ton of complaints about such inconsistencies.

bkamins · 2022-03-12T19:14:48Z

Can interested people vote under this post so that I can move forward with 1.4 release? Thank you!
👍 - make subset consistent with other functions and make effort in the documentation to explain the situation
👎 - keep what we have now (note that I have opened #3021 to make filter more user friendly for this case)

matthieugomez · 2022-03-14T01:07:46Z

I have no preference.

matthieugomez changed the title ~~Potential confusiong with subset~~ Potential confusion with subset Apr 28, 2021

matthieugomez changed the title ~~Potential confusion with subset~~ Potential confusion with subset and == Apr 28, 2021

matthieugomez changed the title ~~Potential confusion with subset and ==~~ Potential confusion with subset Apr 28, 2021

bkamins added the question label Apr 28, 2021

bkamins added this to the patch milestone Apr 28, 2021

bkamins added breaking The proposed change is breaking. decision labels Apr 28, 2021

eloualiche mentioned this issue Apr 29, 2021

Update comparisons with data.table info #2725

Merged

bkamins mentioned this issue May 2, 2021

require AbstractVector from subset selectors #2744

Merged

bkamins closed this as completed in #2744 May 4, 2021

bkamins reopened this Feb 27, 2022

bkamins modified the milestones: patch, 1.4 Feb 27, 2022

bkamins mentioned this issue Mar 30, 2022

allow scalars in subset and subset! as conditions #3032

Merged

bkamins closed this as completed in #3032 Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential confusion with subset #2740

Potential confusion with subset #2740

matthieugomez commented Apr 28, 2021 •

edited

Loading

bkamins commented Apr 28, 2021

matthieugomez commented Apr 28, 2021 •

edited

Loading

bkamins commented Apr 28, 2021

matthieugomez commented Apr 28, 2021

pdeffebach commented Apr 28, 2021

bkamins commented Apr 28, 2021

pdeffebach commented Apr 28, 2021

bkamins commented Apr 28, 2021

ericphanson commented Apr 28, 2021

bkamins commented Apr 29, 2021

bkamins commented May 2, 2021

bkamins commented Feb 27, 2022

bkamins commented Feb 27, 2022

nalimilan commented Mar 1, 2022

matthieugomez commented Mar 1, 2022 via email

nalimilan commented Mar 1, 2022

bkamins commented Mar 1, 2022

matthieugomez commented Mar 5, 2022 •

edited

Loading

bkamins commented Mar 5, 2022

bkamins commented Mar 12, 2022 •

edited

Loading

matthieugomez commented Mar 14, 2022

Potential confusion with subset #2740

Potential confusion with subset #2740

Comments

matthieugomez commented Apr 28, 2021 • edited Loading

bkamins commented Apr 28, 2021

matthieugomez commented Apr 28, 2021 • edited Loading

bkamins commented Apr 28, 2021

matthieugomez commented Apr 28, 2021

pdeffebach commented Apr 28, 2021

bkamins commented Apr 28, 2021

pdeffebach commented Apr 28, 2021

bkamins commented Apr 28, 2021

ericphanson commented Apr 28, 2021

bkamins commented Apr 29, 2021

bkamins commented May 2, 2021

bkamins commented Feb 27, 2022

bkamins commented Feb 27, 2022

nalimilan commented Mar 1, 2022

matthieugomez commented Mar 1, 2022 via email

nalimilan commented Mar 1, 2022

bkamins commented Mar 1, 2022

matthieugomez commented Mar 5, 2022 • edited Loading

bkamins commented Mar 5, 2022

bkamins commented Mar 12, 2022 • edited Loading

matthieugomez commented Mar 14, 2022

matthieugomez commented Apr 28, 2021 •

edited

Loading

matthieugomez commented Apr 28, 2021 •

edited

Loading

matthieugomez commented Mar 5, 2022 •

edited

Loading

bkamins commented Mar 12, 2022 •

edited

Loading