-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] Handle zero groups #2324
Conversation
CC @pdeffebach - you might want to test it, as the cases are tricky. |
Thanks! I just played around with it and I think this is good. It basically just adds new columns so that the returned data frame has the correct names and types. I think this is convenient behavior since it requires less data validation on the user's side. |
Thank you for looking into this. I will re-read the whole code before @nalimilan goes back on-line to make sure we can merge this when he is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looks mostly good. I have to trust you regarding the places where you added checks for zero groups as the code is really tricky...
collect(axes(df, 1)), [1], [nrow(df)], 1, nothing, | ||
Threads.ReentrantLock()) | ||
return GroupedDataFrame(df, Symbol[], ones(Int, nrow(df)), | ||
nothing, nothing, nothing, nrow(df) == 0 ? 0 : 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not continue filling fields with vectors instead of nothing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because they can have 0
or 1
element (this was a bug to fill them before). Now we could conditionally fill them like we fill number of groups, but as filling them later is very cheap anyway I felt that setting them to nothing
is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If computing the actual value here is trivial I'd do it, otherwise I agree it's cheap to compute later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave it for later - this way code is more modular (otherwise we hardcode something here and can forget to update it if we change the default way to compute them in 5 years from now).
Thank you for a review.
I hope I did it right. The changes are in a mix of very old code and new code, so I tried to cover everything in tests. |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
only coverage fails |
I have added the test. only coverage fails as usual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay!
No problem - thank you for looking into it! |
Thank you! |
Fix #2322
Fix #2297
This is a major fix to split-apply-combine that introduces many internal changes and some breaking user visible changes.
What is chiefly changed:
cols
field holdsSymbol
notInt
; this was not strictly needed but asselect!
can mutate a parent of aGroupedDataFrame
it is better to keepSymbols
to avoid invalidating theGroupedDataFrame
transform!
andtransform
combine(arg, ::DataFrame)
when data frame passed has 0 rows which I leave for later as it is tricky to implement and would only obfuscate the code, and the use case is very limited)This is breaking so it will require a minor release to go in.