Allow containers with no columns in combine #2066

bkamins · 2019-12-26T15:01:43Z

Fixes #1981

Allow NamedTuple(), DataFrame and zero columns matrix as return value in combine and map. They are ignored and drop columns.

Additionally fixes some small issues in tests.

@nalimilan - one potential performance consideration is that now I use colnames as Vector{Symbol} - do you think it will be problematic?

Also I had to separate _combine_with_first_row! and _combine_with_first! as separate functions as for some strange reason alternatively in some cases Julia had problems with type inference.

Importantly - we allow these 0-column containers only for vector-oriented containers. They are disallowed in scalar-oriented containers.

Finally - this PR touches parts of code that I do not know very well, so feel free to comment/fix/add test proposals, as some strange interactions might kick in.

nalimilan · 2019-12-26T18:40:45Z

@nalimilan - one potential performance consideration is that now I use colnames as Vector{Symbol} - do you think it will be problematic?

Only benchmarking will tell, but I'm afraid this will make a difference. Would it be possible to have a loop in _combine_with_first to call the function on each group until a non-empty group is returned, and only then call combine_with_first!?

bkamins · 2019-12-26T20:18:12Z

Would it be possible to have a loop in _combine_with_first to call the function on each group until a non-empty group is returned, and only then call combine_with_first!?

I was thinking about it, but it required more code changes. I will try to do it the way you propose to keep colnames a Tuple.

bkamins · 2019-12-26T21:11:29Z

OK - I have reverted to use Tuple for colnames.

nalimilan · 2019-12-27T10:15:29Z

I was thinking about it, but it required more code changes.

Really? I would have though _combine_with_first could just iterate over groups and check whether the result is empty. Then _combine_with_first! would just have to skip empty groups, without having to care about calling itself recursively. Am I missing something?

bkamins · 2019-12-27T10:35:33Z

There are following points:

I would have to change both _combine(f::AbstractVector{<:Pair}, gd::GroupedDataFrame) and _combine(f::Any, gd::GroupedDataFrame)
we would be to remember that if we skipped rows then we should not allow single-row return values
we would have to special case for situation when all groups return a no-column table.

In short - I felt that if I went this way I should change both _combine functions more significantly. But if you prefer I can rewrite it this way. This will save us one level of recursion (and probably one recompilation). What do you think?

nalimilan · 2019-12-27T14:57:02Z

OK, makes sense. This strategy will have to be revised if we want to allow returning an empty tuple for _combine_with_first_row!, but that's not a priority anyway.

nalimilan · 2019-12-27T14:59:58Z

Though can you check that performance doesn't regress e.g. on benchmarks like the one at #1601 (also test a function taking several columns and returning a named tuple)?

src/groupeddataframe/splitapplycombine.jl

nalimilan · 2019-12-27T14:22:20Z

src/groupeddataframe/splitapplycombine.jl

-    idx, outcols, propertynames(first)
+    targetcolnames = tuple(propertynames(first)...)
+    outcols, finalcolnames = first isa Union{AbstractDataFrame,
+                              NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}} ?


Weird indentation. Turn this into an if?

I was not sure what would be best. Changed to if

src/groupeddataframe/splitapplycombine.jl

nalimilan · 2019-12-27T14:31:20Z

src/groupeddataframe/splitapplycombine.jl

    # Handle remaining groups
    @inbounds for i in rowstart+1:len
        rows = wrap(do_call(f, gdidx, starts, ends, gd, incols, i))
+        if !(rows isa Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}})


This check was missing in the existing code?

It would produce an error later (most likely inside append_rows!). I think it was better to have it here because:

it gives a cleaner error message early

in a _ncol(rows) test rows could be a DataFrameRow and if it had 0 columns we would allow it by continue and we should not.

a few lines below I do if rows isa AbstractDataFrame and in else I am sure it is a NamedTuple of vectors in else branch (otherwise I would have to add a third branch throwing an error).

test/grouping.jl

bkamins · 2019-12-27T22:46:43Z

OK, makes sense. This strategy will have to be revised if we want to allow returning an empty tuple for _combine_with_first_row!, but that's not a priority anyway.

I came to the conclusion that _combine_with_first_row! should not allow empty return values. It is a nice invariant to have that guarantees that we know how many rows we are going to get.
(but maybe in the future I will change my mind 😄)

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins · 2019-12-27T23:54:53Z

(also test a function taking several columns and returning a named tuple)

I have added a tests taking a single column and taking two columns

bkamins · 2019-12-28T00:35:01Z

Here are some benchmarks (but feel free to ask for more, as I might have missed some important case):

Data preparation:

using DataFrames, Random
Random.seed!(1234);
df = DataFrame(a = categorical(rand(1:10000, 10^6)), b=rand(10^6));
gdf = groupby(df, :a);

running on 0.20 (after precompilation):

julia> @time combine(gdf, first);
  1.249979 seconds (166.29 k allocations: 15.258 MiB)

julia> @time combine(gdf, (:a, :b) => x -> (a=x.a[1], b=x.b[1]));
  1.436761 seconds (480.75 k allocations: 29.987 MiB)

julia> @time combine(gdf, sdf -> sdf[1:2, :]);
 35.002942 seconds (101.96 M allocations: 22.739 GiB, 5.82% gc time)

julia> @time combine(gdf, (:a, :b) => x -> (a=x.a[1:2], b=x.b[1:2]));
 51.706808 seconds (203.13 M allocations: 37.077 GiB, 5.64% gc time)

running on this PR

julia> @time combine(gdf, first);
  1.269710 seconds (166.28 k allocations: 15.258 MiB)

julia> @time combine(gdf, (:a, :b) => x -> (a=x.a[1], b=x.b[1]));
  1.507670 seconds (641.45 k allocations: 37.083 MiB, 0.49% gc time)

julia> @time combine(gdf, sdf -> sdf[1:2, :]);
 37.266022 seconds (102.03 M allocations: 22.742 GiB, 6.47% gc time)

julia> @time combine(gdf, (:a, :b) => x -> (a=x.a[1:2], b=x.b[1:2]));
 54.060982 seconds (203.44 M allocations: 37.091 GiB, 6.62% gc time)

So it seems we have about 5% regression in the "slow" case.

BTW. Interestingly DataFrame/DataFrameRow is processed faster than NamedTuple in output. Do you know why?

bkamins · 2019-12-28T00:44:14Z

test/grouping.jl

+        df = DataFrame(a = 1:N, x1 = x1)
+        res = by(sdf -> sdf.x1[1] ? fr : er, df, :a)
+        @test res == DataFrame(map(sdf -> sdf.x1[1] ? fr : er, groupby_checked(df, :a)))
+        if fr == [true]


we have a column naming issue here. It is unrelated to this PR:

julia> by(:a => x -> x[1] != 1 ? (x1=[2],) : [1], DataFrame(a=1:2), :a) 2×2 DataFrame │ Row │ a │ a_function │ │ │ Int64 │ Int64 │ ├─────┼───────┼────────────┤ │ 1 │ 1 │ 1 │ │ 2 │ 2 │ 2 │ julia> by(:a => x -> x[1] == 1 ? (x1=[1],) : [2], DataFrame(a=1:2), :a) 2×2 DataFrame │ Row │ a │ x1 │ │ │ Int64 │ Int64 │ ├─────┼───────┼───────┤ │ 1 │ 1 │ 1 │ │ 2 │ 2 │ 2 │

What do you think we should do in such cases (now we inherit column name from the first row in a different way than we check when combining for name consistency).

The problematic line is https://github.com/JuliaData/DataFrames.jl/pull/2066/files#diff-23657e51a9cc9e627fc153ba1e6e04c1L799, as it "manually" overrides column naming mechanics using in combine code.

As you can see it can lead to significant inconsistency (column name passed in NamedTuple got silently overriden by a_function).

An alternative solution would be to disallow mixing vectors and named tuple of a single vector (but current design allows this).

Yeah that's not great, but I'm not sure what better solution we could find. Ideally I guess we could find whether at least one named tuple has been returned, and take its names (that would be doable if we pass a Boolean to _combine_to_with_first!). Maybe file an issue about that?

I have opened #2071. Actually I think this should be an error (unless the automatic column name for a vector and the actual column name in named tuple match)

bkamins · 2019-12-28T09:43:34Z

Thinking of it the regressions are in the table-oriented combine, which is slow anyway - we should expect some regression as we do additional checks in the loop for emptiness.

I think this is an additional argument to avoid allowing dropping rows in row-oriented combine (as it should be as fast as possible).

bkamins · 2019-12-28T10:03:03Z

@nalimilan Steps to reproduce the problem with method dispatch at current commit:

Everywhere in src/groupeddataframe/splitapplycombine.jl replace function names _combine_tables_with_first! and _combine_rows_with_first! with some common name (e.g. _combine_with_first! as it was in the past)
run test/grouping.jl file
The test will get stuck at some operation run in line https://github.com/JuliaData/DataFrames.jl/blob/master/test/grouping.jl#L824

The problem with writing a short code reproducing it is that it "hanging" happens randomly (i.e. at different iteration of this multiply-nested loop). Also if you e.g. extract a "failing" case to an outer expression and run it stand-alone it goes through without hanging most of the time (sometimes it hangs).

Can you please let me know if you are able to reproduce the problem?

nalimilan · 2019-12-28T12:32:39Z

OK, a 5% slowdown is acceptable. A type inference issue would have had dramatic effects. I'm surprised that returning a named tuple of vectors is slower, there must be a type instability somewhere.

I've followed the steps you wrote, but unfortunately I can't reproduce the problem. :-/ Was that with Julia 1.3?

bkamins · 2019-12-28T12:57:03Z

Yes - it is Version 1.3.0 (2019-11-26) on Windows (maybe here is the difference).
Anyway - I think that for our purposes it is better to have two separate names anyway (the code is easier to understand this way).

nalimilan · 2019-12-28T19:01:17Z

Have you tried pushing a branch to see whether the problem happens on Travis? I agree renaming makes sense here anyway, but we'd better track this bug down while we have a reproducer (or kind of...).

bkamins · 2020-01-01T21:40:04Z

@nalimilan - can you please have a final look at this PR if it requires something more. Then we could merge it and I would move to making a PR implementing the proposal in #1975 (probably in the limited version so that we can slowly move forward and simplify checking if all is correct)

nalimilan

Looks good, thanks. I've just added a commit to break lines at 92 chars.

bkamins · 2020-01-02T17:35:11Z

Thank you!

bkamins · 2020-01-02T21:41:00Z

@nalimilan - thank you!

allow containers with no columns in combine

98c7480

bkamins added 2 commits December 26, 2019 21:24

add SubDataFrame to tests

a2bcaa5

retain colnames as Tuple

964520a

nalimilan reviewed Dec 27, 2019

View reviewed changes

bkamins and others added 2 commits December 27, 2019 23:48

Merge branch 'master' into allow_no_cols_combine

4b60f7a

Apply suggestions from code review

14ed8f2

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

fixes after the code review

4c0a494

add more tests

d461a0e

bkamins commented Dec 28, 2019

View reviewed changes

improve docstring

39ab56d

bkamins mentioned this pull request Dec 28, 2019

[DO NOT MERGE] Method dispatch problem in combine #2072

Closed

Break lines at 92 chars

6b9b8aa

nalimilan approved these changes Jan 2, 2020

View reviewed changes

bkamins merged commit f9a0f7c into JuliaData:master Jan 2, 2020

bkamins deleted the allow_no_cols_combine branch January 2, 2020 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow containers with no columns in combine #2066

Allow containers with no columns in combine #2066

bkamins commented Dec 26, 2019

nalimilan commented Dec 26, 2019

bkamins commented Dec 26, 2019

bkamins commented Dec 26, 2019

nalimilan commented Dec 27, 2019

bkamins commented Dec 27, 2019

nalimilan commented Dec 27, 2019

nalimilan commented Dec 27, 2019

nalimilan Dec 27, 2019

bkamins Dec 27, 2019

nalimilan Dec 27, 2019

bkamins Dec 27, 2019

bkamins commented Dec 27, 2019

bkamins commented Dec 27, 2019

bkamins commented Dec 28, 2019 •

edited

Loading

bkamins Dec 28, 2019

nalimilan Dec 28, 2019

bkamins Dec 28, 2019

bkamins commented Dec 28, 2019

bkamins commented Dec 28, 2019

nalimilan commented Dec 28, 2019

bkamins commented Dec 28, 2019

nalimilan commented Dec 28, 2019

bkamins commented Jan 1, 2020

nalimilan left a comment

bkamins commented Jan 2, 2020

bkamins commented Jan 2, 2020

Allow containers with no columns in combine #2066

Allow containers with no columns in combine #2066

Conversation

bkamins commented Dec 26, 2019

nalimilan commented Dec 26, 2019

bkamins commented Dec 26, 2019

bkamins commented Dec 26, 2019

nalimilan commented Dec 27, 2019

bkamins commented Dec 27, 2019

nalimilan commented Dec 27, 2019

nalimilan commented Dec 27, 2019

nalimilan Dec 27, 2019

Choose a reason for hiding this comment

bkamins Dec 27, 2019

Choose a reason for hiding this comment

nalimilan Dec 27, 2019

Choose a reason for hiding this comment

bkamins Dec 27, 2019

Choose a reason for hiding this comment

bkamins commented Dec 27, 2019

bkamins commented Dec 27, 2019

bkamins commented Dec 28, 2019 • edited Loading

bkamins Dec 28, 2019

Choose a reason for hiding this comment

nalimilan Dec 28, 2019

Choose a reason for hiding this comment

bkamins Dec 28, 2019

Choose a reason for hiding this comment

bkamins commented Dec 28, 2019

bkamins commented Dec 28, 2019

nalimilan commented Dec 28, 2019

bkamins commented Dec 28, 2019

nalimilan commented Dec 28, 2019

bkamins commented Jan 1, 2020

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Jan 2, 2020

bkamins commented Jan 2, 2020

bkamins commented Dec 28, 2019 •

edited

Loading