Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance joining and grouping #850

Closed
wants to merge 110 commits into from
Closed

Conversation

alyst
Copy link
Contributor

@alyst alyst commented Aug 7, 2015

PR addresses 4 limitations of current joining and grouping implementation:

  • the way row groups are indexed for multicolumn joining/grouping is very sparse (i.e. many usable indices have no rows assigned) and can lead to math overflows even for medium size frames
  • rows order in the left-joined or outer-joined frame doesn't match the rows order in the left frame (which is usually nice to maintain)
  • for any join kind, including :inner, :semi, :anti full matching of left and right frames is done, which is not necessary
  • poor performance for frames containing PooledDataVector with large pools due to slow setindex!(PooledDataArray).

Here's the simple benchmarking scripts to test the performance.

using Distributions, DataFrames

function random_frame(nrow::Int, col_values::Dict{Symbol, Any})
  DataFrame(Any[isa(col_values[key], PooledDataArray) ?
                @pdata(sample(col_values[key], nrow)) :
                @data(sample(col_values[key], nrow)) for key in keys(col_values)],
            keys(col_values) |> collect)
end

function random_join(kind::Symbol, nrow_left::Int, nrow_right::Int,
                     on_col_values::Dict{Symbol, Any},
                     left_col_values::Dict{Symbol, Any},
                     right_col_values::Dict{Symbol, Any})
  dfl = random_frame(nrow_left, merge(on_col_values, left_col_values))
  dfr = random_frame(nrow_right, merge(on_col_values, right_col_values))
  join(dfl, dfr, on = keys(on_col_values) |> collect, kind = kind)
end

function f(n::Int)
  for i in 1:n
    r = random_join(:outer, 1000, 2000,
                Dict{Symbol,Any}(:A => 1:10, :B => @data([:A, :B, :C, :D]),
                                 :C => 1:10, :D => 1:10),
                Dict{Symbol,Any}(:E => 1:10, :F => @data([:A, :B, :C, :D])),
                Dict{Symbol,Any}(:G => 1:10, :H => @data([:A, :B, :C, :D])))
  end
end

f(1)

@time f(100)

For the simple cases PR has almost the same join performance as the current implementation (increased GC overhead is likely due to the allocation of additional arrays that store the rows order in the resulting frame):

Current times:

  7.196765 seconds (60.15 M allocations: 1.920 GB, 3.94% gc time)
  7.134183 seconds (60.16 M allocations: 1.920 GB, 4.01% gc time)
  7.180465 seconds (60.15 M allocations: 1.920 GB, 3.95% gc time)

PR times:

  7.460440 seconds (139.77 M allocations: 3.300 GB, 5.42% gc time)
  7.377374 seconds (139.68 M allocations: 3.298 GB, 5.28% gc time)
  7.340252 seconds (139.88 M allocations: 3.301 GB, 5.60% gc time)

However, if in the previous random table generation test, some columns would be converted into pooled vectors, the current implementation would fail

function g(n::Int)
  for i in 1:n
    r = random_join(:outer, 1000, 2000,
                Dict{Symbol,Any}(:A => 1:10, :B => @data([:A, :B, :C, :D]),
                                 :C => @pdata(1:10), :D => 1:10),
                Dict{Symbol,Any}(:E => 1:10, :F => @pdata([:A, :B, :C, :D])),
                Dict{Symbol,Any}(:G => 1:10, :H => @data([:A, :B, :C, :D])))
  end
end
julia> g(1)
ERROR: InexactError()
 in setindex! at ./array.jl:303

(InexactError is a sign that group index exceeded the typemax(eltype(pooled_vector.refs)))

While for PR the times are

  7.760247 seconds (139.69 M allocations: 3.310 GB, 5.17% gc time)
  7.768788 seconds (139.61 M allocations: 3.308 GB, 5.11% gc time)
  7.776039 seconds (139.76 M allocations: 3.310 GB, 5.18% gc time)

The PR wins if the frames contain e.g. PooledDataVector columns with suffificiently large pools:

function h(n::Int)
    for i in 1:n
        r = random_join(:outer, 10000, 20000,
                       Dict{Symbol,Any}(:A => @pdata(1:10000)),
                       Dict{Symbol,Any}(:B => @pdata(1:10000)),
                       Dict{Symbol,Any}(:C => @pdata(1:10000)))
        end
    end

h(1)

@time h(100)

Current implementation:

 19.152983 seconds (63.97 M allocations: 2.064 GB, 1.13% gc time)
 19.126810 seconds (63.96 M allocations: 2.064 GB, 1.11% gc time)
 19.198262 seconds (63.97 M allocations: 2.064 GB, 1.02% gc time)

PR:

  5.145708 seconds (87.02 M allocations: 2.756 GB, 6.64% gc time)
  5.368688 seconds (87.02 M allocations: 2.755 GB, 6.75% gc time)
  5.236543 seconds (87.00 M allocations: 2.741 GB, 6.99% gc time)

@alyst alyst force-pushed the enhance_join branch 4 times, most recently from 9d29327 to df8411e Compare August 7, 2015 21:00
@alyst alyst mentioned this pull request Aug 22, 2015
@alyst alyst force-pushed the enhance_join branch 2 times, most recently from e697c4b to 4772d3b Compare August 26, 2015 15:57
@alyst
Copy link
Contributor Author

alyst commented Aug 26, 2015

I've updated the PR. The newer version should be faster and use less memory. Now _RowGroupDict implements its own memory-efficient hashing of rows instead of using Dict{DataFrameRow, Int}.

As requested by @matthieugomez, here's a small benchmark for groupby()

using RDatasets

diamonds = dataset("ggplot2", "diamonds");
#diamonds[:Cut] = convert(PooledDataArray{ASCIIString, UInt}, diamonds[:Cut])
diamonds[:Cut] = convert(DataArray{ASCIIString}, diamonds[:Cut]);
diamonds[:Clarity] = convert(DataArray{ASCIIString}, diamonds[:Clarity]);
f(n) = for i in 1:n groupby(diamonds, [:Clarity, :Carat]) end;
f(1)
Profile.clear_malloc_data()
@time f(100)

(without column :Cut conversion groupby() fails on master).

DataFrames.jl master:

2.258086 seconds (37.01 M allocations: 783.709 MB, 2.24% gc time)

This PR:

  1.833439 seconds (26.86 M allocations: 591.702 MB, 2.00% gc time)

In this example the new implementation is both faster and memory efficient. But I don't think the reported numbers reflect the actual efficiency of the implementations. Enormous number of allocations are probably due to elements indexing methods etc. Fixing just JuliaStats/DataArrays.jl#163 already reduced the number of allocations by 50% and there should be some other hotspots. In reality, the difference should be around few hundred allocations.

Note also that doing this on master (after converting resp. columns ref types to UInt):

groupby(diamonds, [:Cut, :Carat, :Clarity, :Depth, :Table, :Price, :Color])

could even crash some systems. The problem is that during grouping the estimated number of groups (ngroups) would grow very large. The current implementation calls DataArrays.groupsort_indexer() to throw away unused group indices, and it tries to allocate vector of ngroups length. So we hit the limitations of the current implementation much before overflowing UInt (see also #862). One potential fix would be to call groupsort_indexer() for each new column added into indexing with the obvious impact on performance.

@alyst alyst force-pushed the enhance_join branch 5 times, most recently from 08aec90 to 8802e10 Compare August 30, 2015 21:30
@alyst
Copy link
Contributor Author

alyst commented Aug 30, 2015

For big frames sorting of the row groups can take quite some time. Also, IMHO it's more logical to preserve the original order of rows as much as possible by default. So I've added sort= option (disabled by default) to by() and groupby().

@matthieugomez
Copy link
Contributor

Great. btw dplyr sorts and datatable does not

@tshort
Copy link
Contributor

tshort commented Nov 19, 2015

There's a lot to like about this PR. The problem is well written out. The added code has good comments and tests. Doc strings are updated.

My main hesitation is that there's a lot of code churn, and the code looks more complicated. The basic grouping code needs work to improve performance (probably a rewrite). We are at least a factor of ten slower than R's data.table and dplyr packages. This PR doesn't significantly improve grouping speeds based on this code. My review/opinion here is on the grouping part of the code and not joins.

@alyst
Copy link
Contributor Author

alyst commented Nov 20, 2015

@tshort Thanks for the review!
I think that to move forward we would also need to add the benchmarks for getindex()/setindex!() etc basic functions in DataArrays/NullableArrays (it would also make sense to compare the results with data.table/dplyr). The suspiciously high number of allocations in the @time output (see above) both for master and this PR suggests there could be some type stability problems upstream.

Gord Stephen and others added 8 commits September 14, 2016 10:13
…iaData#1042)

Add compatibility with pre-contrasts ModelFrame constructor
Completely remove support for DataArrays.
This depends on PRs moving these into NullableArrays.jl.
Also use isequal() instead of ==, as the latter is in Base and
unlikely to change its semantics.
groupby() did not follow the order of levels, and wasn't robust to reordering
levels. Add tests for corner cases.
Use the fallbacks for now, should be added back after
JuliaData/CategoricalArrays.jl#12 is fixed.
alyst added 14 commits January 9, 2017 13:27
so that DataFrameRow object doesn't need to be created
RowGroupDict that implements memory-efficient hashing of
the data frame rows and its methods
- don't encode the indexing columns, use DataFrameRow hashes instead
- do only the parts of left-right rows matching that are required for a
  particular join kind
- avoid vcat() that is very slow for [Nullable]CategoricalVector
- now join respects left-frame order for all join kinds, so the
  tests/data.jl test were updated
sorting order is changed from null first to null last (it matches
the default data frame sorting)
by default no sorting is applied to preserve original ordering
(the initial order of the 1st rows is preserved) and make things faster
refactor unsafe_hashindex() for data frame elements into hash_colel()
that is marked with @propagate_inbounds
@nalimilan
Copy link
Member

Did you intend to revive the PR?

@alyst
Copy link
Contributor Author

alyst commented Jan 22, 2017

@nalimilan It's not so dead, I've rebased it recently. :) I was just waiting for the NullableArrays stuff to settle down until thinking of merging this PR again. It relies on e.g. JuliaStats/NullableArrays.jl#158, which was not merged, because of the anticipated support of lifting in Base.
I'm not up-to-date with the Nullable progress, is it mostly done?

@nalimilan
Copy link
Member

I'm not up-to-date with the Nullable progress, is it mostly done?

AFAIK it's mostly done for a long time now, what's needed is the high-level API (StructuredQueries) to make it easier to use.

Regarding support in NullableArrays, I will comment on the other PR.

@nalimilan
Copy link
Member

nalimilan commented Feb 18, 2017

@alyst The rebase has gone wrong, it's very hard to see your changes now. EDIT: of course, that's because master switched to the old DataArrays based branch. I guess the best thing to do is move the PR to DataTables now.

Can you tell us more about the groupby algorithm you've adopted here? Cf. discussion at JuliaData/DataTables.jl#3.

Also, what changes do you need in NullableArrays exactly?

@alyst
Copy link
Contributor Author

alyst commented Feb 18, 2017

@nalimilan I must admit I was not following the recent development in the DataFrames.jl, the introduction of DataTables.jl etc. IIUC, it looks like the master branch was replaced at some point, that's why the merge has conflicts. I can try rebasing the PR again, if it would have any value given JuliaData/DataTables.jl#3.

The grouping algorithm (it's also used for joining and finding duplicate rows) uses the dedicated hash table that is optimized for data frames: hashes are generated along the columns for more effective memory access. The custom hash table solves the problem of generating unique indices for rows when doing multi-column groups/joins on the real data. The master implementation was "naive" and often led to integer overflows or tried to allocate insane amounts of memory.

Unfortunately, now I don't recall whether the changes in Nullable/Categorical were required to make the current tests pass or it was to address some bugs I discovered while using the PR for my data. Also at the time I rebased this PR to the NullableArrays-using master, there was some API polishing going on. Maybe all the problems are naturally resolved now.

@nalimilan
Copy link
Member

What happened is that the master branch was moved to DataTables. So if you do git format-patch and then git am in DataTables, it should work (maybe after replacing DataFrame with DataTable and df| with dt` in the resulting patch).

AFAIK, the current code was inspired by Pandas, except that it didn't implement the code to avoid overflow. Do you think your code is as fast as Pandas, including when all input variables are categorical?

@alyst
Copy link
Contributor Author

alyst commented Feb 19, 2017

@nalimilan I'm not a Python/Pandas user, so I cannot comment on the performance comparison. It could be that for single-column joins the current code is faster (it's hard to beat, because it just directly uses the column values for indexing), but I spent some time trying to optimize the code for more complex "real life" scenarios. There is a specialized version of column hashing for categorical arrays and nullable arrays. However, to make the joins between nullable/non-nullable and categorical/non-categorical columns work correctly the hashes have to use the underlying values.
It's discussed above that there were some type inference problems in DataArrays.jl, so subtle changes in how the values are accessed had big impact on the performance. That was also one of the things I was taking into account, though for NullableArrays it might be different.

@nalimilan
Copy link
Member

The PR has now been merged in DataTables (JuliaData/DataTables.jl#17).

@alyst
Copy link
Contributor Author

alyst commented Mar 6, 2017

JuliaData/DataTables.jl#17 contains many improvements over this PR. So if there is need to introduce these changes to DataFrames.jl, JuliaData/DataTables.jl#17 should be used as a reference. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.