-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential confusion with subset #2740
Comments
==
==
We could add a check if the predicate returns |
@bkamins For this fix, the downside would be that one can no longer use using DataFrames
df = DataFrame(id = [0, 0, 1, 1], x = [-1, 1, 3, 4])
subset(groupby(df, :id), :x => (x -> sum(x) > 0)) (not saying it's a bad idea, just trying to understand the tradeoff) |
Yes - that would be the tradeoff if we wanted to be consistent. We could theoretically apply this rule only to data frames, but I am not sure it is a good idea. Note that |
yes, i think it'd be ok to disallow it. Also cc-ing @pdeffebach |
Yes definitely! I was not aware that |
OK. It will not be super simple so I will do it once we resolve performance issues in 1.0.1 release. |
yes. But you need to account for |
|
Would it make sense to have a using DataFrames
df = DataFrame(id = [0, 0, 1, 1], x = [-1, 1, 3, 4])
subset(groupby(df, :id), :x => ByGroup(x -> sum(x) > 0)) in order to filter groups? I think though that would kind of imply that returning a vector from |
This is what
I suggested above to clearly differentiate |
see #2744 for a PR implementing this. A careful look and maybe additional test proposals would be welcome (in particular: the old behavior was tested in the previous tests - so if you could kindly have a look at tests now and judge if they are doing what we want would be great). |
I have reopened this issue as it is constantly raised by users. What @matthieugomez reported is one side of the situation. The other side is that by disallowing scalars we do not have an easy way to filter whole groups. See e.g. https://stackoverflow.com/questions/71286257/filter-grouped-dataframe-in-julia for a recent discussion. Let us discuss it again before mating a 1.4 release. |
maybe we should add a kwarg that would allow scalars i.e. we would have |
That's a possibility, though it's not super user-friendly and not consistent with e.g. Do we have evidence that it creates lots of confusion? The examples provided in the description seem legitimate, but they give so obviously incorrect results that at least people will notice that the syntax doesn't do what they expected. OTOH if we keep disallowing scalar results by default but print an explicit error advising to use |
I would rather have consistency rather than special cases with hard to find
options. How about displaying a warning message for the examples in my
original post?
…On Tue, Mar 1, 2022 at 8:37 AM Milan Bouchet-Valat ***@***.***> wrote:
That's a possibility, though it's not super user-friendly and not
consistent with e.g. select. Maybe requiring AbstractVector was too
strict? If we now consider that filter is not a convenient enough
function given that it doesn't fit well with @chain blocks, and recommend
users to rely on subset to filter groups, it could make sense to
broadcast scalars by default.
Do we have evidence that it creates lots of confusion? The examples
provided in the description seem legitimate, but they give so obviously
incorrect results that at least people will notice that the syntax doesn't
do what they expected.
OTOH if we keep disallowing scalar results by default but print an
explicit error advising to use broadcast=true maybe that's user-friendly
enough.
—
Reply to this email directly, view it on GitHub
<#2740 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXKB2KPW2WXAYCKUGZDU5YMQJANCNFSM43XM55ZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Warnings are not really a solution as they are annoying/confusing when you really intended to write what you wrote. So we would need a way to turn them off, in which case throwing an error would be better. |
I agree we should not print warnings. The crucial question is what @nalimilan asked:
i.e. I know someone can make an error but how risky is this? I.e. how high is the risk that the user writes an incorrect condition and does not notice the problem. In particular, it mostly applies to |
What about allowing it for GroupedDataFrame but not for DataFrame (if it’s doable)? |
It is technically possible, but I this would break one of the most important contracts of DataFrames.jl that operations on |
Can interested people vote under this post so that I can move forward with 1.4 release? Thank you! |
I have no preference. |
subset
acts on vectors by default. This can introduce some confusion if users forget to useByRow
:This happens because
==
andismissing
are applied to vectorsUnfortunately, this is very error prone. I was wondering whether something should be done about it.
The text was updated successfully, but these errors were encountered: