Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow containers with no columns in combine #2066

Merged
merged 9 commits into from
Jan 2, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Dec 26, 2019

Fixes #1981

Allow NamedTuple(), DataFrame and zero columns matrix as return value in combine and map. They are ignored and drop columns.

Additionally fixes some small issues in tests.

@nalimilan - one potential performance consideration is that now I use colnames as Vector{Symbol} - do you think it will be problematic?

Also I had to separate _combine_with_first_row! and _combine_with_first! as separate functions as for some strange reason alternatively in some cases Julia had problems with type inference.

Importantly - we allow these 0-column containers only for vector-oriented containers. They are disallowed in scalar-oriented containers.

Finally - this PR touches parts of code that I do not know very well, so feel free to comment/fix/add test proposals, as some strange interactions might kick in.

@nalimilan
Copy link
Member

@nalimilan - one potential performance consideration is that now I use colnames as Vector{Symbol} - do you think it will be problematic?

Only benchmarking will tell, but I'm afraid this will make a difference. Would it be possible to have a loop in _combine_with_first to call the function on each group until a non-empty group is returned, and only then call combine_with_first!?

@bkamins
Copy link
Member Author

bkamins commented Dec 26, 2019

Would it be possible to have a loop in _combine_with_first to call the function on each group until a non-empty group is returned, and only then call combine_with_first!?

I was thinking about it, but it required more code changes. I will try to do it the way you propose to keep colnames a Tuple.

@bkamins
Copy link
Member Author

bkamins commented Dec 26, 2019

OK - I have reverted to use Tuple for colnames.

@nalimilan
Copy link
Member

I was thinking about it, but it required more code changes.

Really? I would have though _combine_with_first could just iterate over groups and check whether the result is empty. Then _combine_with_first! would just have to skip empty groups, without having to care about calling itself recursively. Am I missing something?

@bkamins
Copy link
Member Author

bkamins commented Dec 27, 2019

There are following points:

  • I would have to change both _combine(f::AbstractVector{<:Pair}, gd::GroupedDataFrame) and _combine(f::Any, gd::GroupedDataFrame)
  • we would be to remember that if we skipped rows then we should not allow single-row return values
  • we would have to special case for situation when all groups return a no-column table.

In short - I felt that if I went this way I should change both _combine functions more significantly. But if you prefer I can rewrite it this way. This will save us one level of recursion (and probably one recompilation). What do you think?

@nalimilan
Copy link
Member

OK, makes sense. This strategy will have to be revised if we want to allow returning an empty tuple for _combine_with_first_row!, but that's not a priority anyway.

@nalimilan
Copy link
Member

Though can you check that performance doesn't regress e.g. on benchmarks like the one at #1601 (also test a function taking several columns and returning a named tuple)?

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
idx, outcols, propertynames(first)
targetcolnames = tuple(propertynames(first)...)
outcols, finalcolnames = first isa Union{AbstractDataFrame,
NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}} ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird indentation. Turn this into an if?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure what would be best. Changed to if

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
# Handle remaining groups
@inbounds for i in rowstart+1:len
rows = wrap(do_call(f, gdidx, starts, ends, gd, incols, i))
if !(rows isa Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check was missing in the existing code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would produce an error later (most likely inside append_rows!). I think it was better to have it here because:

  1. it gives a cleaner error message early
  2. in a _ncol(rows) test rows could be a DataFrameRow and if it had 0 columns we would allow it by continue and we should not.
  3. a few lines below I do if rows isa AbstractDataFrame and in else I am sure it is a NamedTuple of vectors in else branch (otherwise I would have to add a third branch throwing an error).

test/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Dec 27, 2019

OK, makes sense. This strategy will have to be revised if we want to allow returning an empty tuple for _combine_with_first_row!, but that's not a priority anyway.

I came to the conclusion that _combine_with_first_row! should not allow empty return values. It is a nice invariant to have that guarantees that we know how many rows we are going to get.
(but maybe in the future I will change my mind 😄)

@bkamins
Copy link
Member Author

bkamins commented Dec 27, 2019

(also test a function taking several columns and returning a named tuple)

I have added a tests taking a single column and taking two columns

@bkamins
Copy link
Member Author

bkamins commented Dec 28, 2019

Here are some benchmarks (but feel free to ask for more, as I might have missed some important case):

  • Data preparation:
using DataFrames, Random
Random.seed!(1234);
df = DataFrame(a = categorical(rand(1:10000, 10^6)), b=rand(10^6));
gdf = groupby(df, :a);
  • running on 0.20 (after precompilation):
julia> @time combine(gdf, first);
  1.249979 seconds (166.29 k allocations: 15.258 MiB)

julia> @time combine(gdf, (:a, :b) => x -> (a=x.a[1], b=x.b[1]));
  1.436761 seconds (480.75 k allocations: 29.987 MiB)

julia> @time combine(gdf, sdf -> sdf[1:2, :]);
 35.002942 seconds (101.96 M allocations: 22.739 GiB, 5.82% gc time)

julia> @time combine(gdf, (:a, :b) => x -> (a=x.a[1:2], b=x.b[1:2]));
 51.706808 seconds (203.13 M allocations: 37.077 GiB, 5.64% gc time)
  • running on this PR
julia> @time combine(gdf, first);
  1.269710 seconds (166.28 k allocations: 15.258 MiB)

julia> @time combine(gdf, (:a, :b) => x -> (a=x.a[1], b=x.b[1]));
  1.507670 seconds (641.45 k allocations: 37.083 MiB, 0.49% gc time)

julia> @time combine(gdf, sdf -> sdf[1:2, :]);
 37.266022 seconds (102.03 M allocations: 22.742 GiB, 6.47% gc time)

julia> @time combine(gdf, (:a, :b) => x -> (a=x.a[1:2], b=x.b[1:2]));
 54.060982 seconds (203.44 M allocations: 37.091 GiB, 6.62% gc time)

So it seems we have about 5% regression in the "slow" case.

BTW. Interestingly DataFrame/DataFrameRow is processed faster than NamedTuple in output. Do you know why?

df = DataFrame(a = 1:N, x1 = x1)
res = by(sdf -> sdf.x1[1] ? fr : er, df, :a)
@test res == DataFrame(map(sdf -> sdf.x1[1] ? fr : er, groupby_checked(df, :a)))
if fr == [true]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a column naming issue here. It is unrelated to this PR:

julia> by(:a => x -> x[1] != 1 ? (x1=[2],) : [1], DataFrame(a=1:2), :a)
2×2 DataFrame
│ Row │ a     │ a_function │
│     │ Int64 │ Int64      │
├─────┼───────┼────────────┤
│ 1   │ 1     │ 1          │
│ 2   │ 2     │ 2          │

julia> by(:a => x -> x[1] == 1 ? (x1=[1],) : [2], DataFrame(a=1:2), :a)
2×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │

What do you think we should do in such cases (now we inherit column name from the first row in a different way than we check when combining for name consistency).

The problematic line is https://github.com/JuliaData/DataFrames.jl/pull/2066/files#diff-23657e51a9cc9e627fc153ba1e6e04c1L799, as it "manually" overrides column naming mechanics using in combine code.

As you can see it can lead to significant inconsistency (column name passed in NamedTuple got silently overriden by a_function).

An alternative solution would be to disallow mixing vectors and named tuple of a single vector (but current design allows this).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's not great, but I'm not sure what better solution we could find. Ideally I guess we could find whether at least one named tuple has been returned, and take its names (that would be doable if we pass a Boolean to _combine_to_with_first!). Maybe file an issue about that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened #2071. Actually I think this should be an error (unless the automatic column name for a vector and the actual column name in named tuple match)

@bkamins
Copy link
Member Author

bkamins commented Dec 28, 2019

Thinking of it the regressions are in the table-oriented combine, which is slow anyway - we should expect some regression as we do additional checks in the loop for emptiness.

I think this is an additional argument to avoid allowing dropping rows in row-oriented combine (as it should be as fast as possible).

@bkamins
Copy link
Member Author

bkamins commented Dec 28, 2019

@nalimilan Steps to reproduce the problem with method dispatch at current commit:

  1. Everywhere in src/groupeddataframe/splitapplycombine.jl replace function names _combine_tables_with_first! and _combine_rows_with_first! with some common name (e.g. _combine_with_first! as it was in the past)
  2. run test/grouping.jl file
  3. The test will get stuck at some operation run in line https://github.com/JuliaData/DataFrames.jl/blob/master/test/grouping.jl#L824

The problem with writing a short code reproducing it is that it "hanging" happens randomly (i.e. at different iteration of this multiply-nested loop). Also if you e.g. extract a "failing" case to an outer expression and run it stand-alone it goes through without hanging most of the time (sometimes it hangs).

Can you please let me know if you are able to reproduce the problem?

@nalimilan
Copy link
Member

OK, a 5% slowdown is acceptable. A type inference issue would have had dramatic effects. I'm surprised that returning a named tuple of vectors is slower, there must be a type instability somewhere.

I've followed the steps you wrote, but unfortunately I can't reproduce the problem. :-/ Was that with Julia 1.3?

@bkamins
Copy link
Member Author

bkamins commented Dec 28, 2019

Yes - it is Version 1.3.0 (2019-11-26) on Windows (maybe here is the difference).
Anyway - I think that for our purposes it is better to have two separate names anyway (the code is easier to understand this way).

@nalimilan
Copy link
Member

Have you tried pushing a branch to see whether the problem happens on Travis? I agree renaming makes sense here anyway, but we'd better track this bug down while we have a reproducer (or kind of...).

@bkamins
Copy link
Member Author

bkamins commented Jan 1, 2020

@nalimilan - can you please have a final look at this PR if it requires something more. Then we could merge it and I would move to making a PR implementing the proposal in #1975 (probably in the limited version so that we can slowly move forward and simplify checking if all is correct)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks. I've just added a commit to break lines at 92 chars.

@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2020

Thank you!

@bkamins bkamins merged commit f9a0f7c into JuliaData:master Jan 2, 2020
@bkamins bkamins deleted the allow_no_cols_combine branch January 2, 2020 21:40
@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2020

@nalimilan - thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow returning DataFrame() or NamedTuple() in combine
2 participants