-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow containers with no columns in combine #2066
Conversation
Only benchmarking will tell, but I'm afraid this will make a difference. Would it be possible to have a loop in |
I was thinking about it, but it required more code changes. I will try to do it the way you propose to keep |
OK - I have reverted to use |
Really? I would have though |
There are following points:
In short - I felt that if I went this way I should change both |
OK, makes sense. This strategy will have to be revised if we want to allow returning an empty tuple for |
Though can you check that performance doesn't regress e.g. on benchmarks like the one at #1601 (also test a function taking several columns and returning a named tuple)? |
idx, outcols, propertynames(first) | ||
targetcolnames = tuple(propertynames(first)...) | ||
outcols, finalcolnames = first isa Union{AbstractDataFrame, | ||
NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}} ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird indentation. Turn this into an if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure what would be best. Changed to if
# Handle remaining groups | ||
@inbounds for i in rowstart+1:len | ||
rows = wrap(do_call(f, gdidx, starts, ends, gd, incols, i)) | ||
if !(rows isa Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check was missing in the existing code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would produce an error later (most likely inside append_rows!
). I think it was better to have it here because:
- it gives a cleaner error message early
- in a
_ncol(rows)
testrows
could be aDataFrameRow
and if it had 0 columns we would allow it bycontinue
and we should not. - a few lines below I do
if rows isa AbstractDataFrame
and inelse
I am sure it is aNamedTuple
of vectors inelse
branch (otherwise I would have to add a third branch throwing an error).
I came to the conclusion that |
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
I have added a tests taking a single column and taking two columns |
Here are some benchmarks (but feel free to ask for more, as I might have missed some important case):
So it seems we have about 5% regression in the "slow" case. BTW. Interestingly |
df = DataFrame(a = 1:N, x1 = x1) | ||
res = by(sdf -> sdf.x1[1] ? fr : er, df, :a) | ||
@test res == DataFrame(map(sdf -> sdf.x1[1] ? fr : er, groupby_checked(df, :a))) | ||
if fr == [true] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a column naming issue here. It is unrelated to this PR:
julia> by(:a => x -> x[1] != 1 ? (x1=[2],) : [1], DataFrame(a=1:2), :a)
2×2 DataFrame
│ Row │ a │ a_function │
│ │ Int64 │ Int64 │
├─────┼───────┼────────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
julia> by(:a => x -> x[1] == 1 ? (x1=[1],) : [2], DataFrame(a=1:2), :a)
2×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
What do you think we should do in such cases (now we inherit column name from the first row in a different way than we check when combining for name consistency).
The problematic line is https://github.com/JuliaData/DataFrames.jl/pull/2066/files#diff-23657e51a9cc9e627fc153ba1e6e04c1L799, as it "manually" overrides column naming mechanics using in combine code.
As you can see it can lead to significant inconsistency (column name passed in NamedTuple
got silently overriden by a_function
).
An alternative solution would be to disallow mixing vectors and named tuple of a single vector (but current design allows this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's not great, but I'm not sure what better solution we could find. Ideally I guess we could find whether at least one named tuple has been returned, and take its names (that would be doable if we pass a Boolean to _combine_to_with_first!
). Maybe file an issue about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opened #2071. Actually I think this should be an error (unless the automatic column name for a vector and the actual column name in named tuple match)
Thinking of it the regressions are in the table-oriented combine, which is slow anyway - we should expect some regression as we do additional checks in the loop for emptiness. I think this is an additional argument to avoid allowing dropping rows in row-oriented combine (as it should be as fast as possible). |
@nalimilan Steps to reproduce the problem with method dispatch at current commit:
The problem with writing a short code reproducing it is that it "hanging" happens randomly (i.e. at different iteration of this multiply-nested loop). Also if you e.g. extract a "failing" case to an outer expression and run it stand-alone it goes through without hanging most of the time (sometimes it hangs). Can you please let me know if you are able to reproduce the problem? |
OK, a 5% slowdown is acceptable. A type inference issue would have had dramatic effects. I'm surprised that returning a named tuple of vectors is slower, there must be a type instability somewhere. I've followed the steps you wrote, but unfortunately I can't reproduce the problem. :-/ Was that with Julia 1.3? |
Yes - it is Version 1.3.0 (2019-11-26) on Windows (maybe here is the difference). |
Have you tried pushing a branch to see whether the problem happens on Travis? I agree renaming makes sense here anyway, but we'd better track this bug down while we have a reproducer (or kind of...). |
@nalimilan - can you please have a final look at this PR if it requires something more. Then we could merge it and I would move to making a PR implementing the proposal in #1975 (probably in the limited version so that we can slowly move forward and simplify checking if all is correct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks. I've just added a commit to break lines at 92 chars.
Thank you! |
@nalimilan - thank you! |
Fixes #1981
Allow
NamedTuple()
,DataFrame
and zero columns matrix as return value incombine
andmap
. They are ignored and drop columns.Additionally fixes some small issues in tests.
@nalimilan - one potential performance consideration is that now I use
colnames
asVector{Symbol}
- do you think it will be problematic?Also I had to separate
_combine_with_first_row!
and_combine_with_first!
as separate functions as for some strange reason alternatively in some cases Julia had problems with type inference.Importantly - we allow these 0-column containers only for vector-oriented containers. They are disallowed in scalar-oriented containers.
Finally - this PR touches parts of code that I do not know very well, so feel free to comment/fix/add test proposals, as some strange interactions might kick in.