Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement faster innerjoin #2612

Merged
merged 63 commits into from
Feb 13, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
1509f63
implement faster innerjoin
bkamins Jan 26, 2021
2b222b7
add handling of sorted tables
bkamins Jan 27, 2021
0eb911e
fix eltype test
bkamins Jan 27, 2021
a16b6f2
use strategy with single index pool in case of duplicates
bkamins Jan 28, 2021
14652f0
add tests for innerjoin
bkamins Jan 29, 2021
6a5b6ca
fast path for PooledArrays
bkamins Jan 29, 2021
c9385da
update handling of PooledArrays and CategoricalArrays
bkamins Jan 30, 2021
b8907ae
add more tests
bkamins Jan 31, 2021
c4d2c46
Apply suggestions from code review
bkamins Feb 2, 2021
3f8c49f
Update src/abstractdataframe/join.jl
bkamins Feb 2, 2021
e306de1
add more comments and optimistically try sorted join algorithm
bkamins Feb 2, 2021
47fe234
fix lookup
bkamins Feb 2, 2021
c24f678
Apply suggestions from code review
bkamins Feb 2, 2021
928f372
use DataAPI.invrefpool
bkamins Feb 3, 2021
3842c3d
Merge remote-tracking branch 'origin/new_faster_innerjoin' into new_f…
bkamins Feb 3, 2021
dba0f36
Merge branch 'main' into new_faster_innerjoin
bkamins Feb 3, 2021
9cf62af
Apply suggestions from code review
bkamins Feb 3, 2021
3912ae6
use nothing as sentinel
bkamins Feb 3, 2021
68a8eaa
Apply suggestions from code review
bkamins Feb 3, 2021
c05410b
remove PooledArrays.jl specific code
bkamins Feb 4, 2021
a8b2702
Apply suggestions from code review
bkamins Feb 4, 2021
5ec767f
corrections after the review
bkamins Feb 4, 2021
1db13ef
Apply suggestions from code review
bkamins Feb 4, 2021
7f2d897
add OnCol
bkamins Feb 5, 2021
9208bff
add faster processing of integer columns
bkamins Feb 5, 2021
1735712
Apply suggestions from code review
bkamins Feb 5, 2021
bb7e8f1
minor changes
bkamins Feb 6, 2021
6d46f1b
fix test coverage
bkamins Feb 6, 2021
1ef1362
fix test coverage
bkamins Feb 6, 2021
eb50756
revert change for better clarity
bkamins Feb 6, 2021
7cfb5b4
another small fix
bkamins Feb 6, 2021
d9dd15f
fix method definition
bkamins Feb 6, 2021
fd03587
Apply suggestions from code review
bkamins Feb 6, 2021
e30f51a
change hash implementation
bkamins Feb 6, 2021
6aa95e3
fix typo
bkamins Feb 6, 2021
bdcaeef
fix tests
bkamins Feb 6, 2021
1150126
consistent detection of CategoricalArrays.jl types
bkamins Feb 6, 2021
0c1e8b6
Apply suggestions from code review
bkamins Feb 6, 2021
ca02cc9
Merge remote-tracking branch 'origin/new_faster_innerjoin' into new_f…
bkamins Feb 6, 2021
558129d
add hash test
bkamins Feb 6, 2021
cbe214e
in printing we might have union
bkamins Feb 6, 2021
e01c1fe
fix typo
bkamins Feb 6, 2021
f50b9a1
simplify isless and isequal for OnCol
bkamins Feb 6, 2021
a592b09
Update test/join.jl
bkamins Feb 7, 2021
1515c07
add more tests
bkamins Feb 7, 2021
75560b1
Merge remote-tracking branch 'origin/new_faster_innerjoin' into new_f…
bkamins Feb 7, 2021
d8f1fe4
additional tests
bkamins Feb 7, 2021
bb83527
add innerjoin benchmark
bkamins Feb 7, 2021
56c4c5e
more tests to ensure full coverage
bkamins Feb 7, 2021
f697177
add linebreaks at @info
bkamins Feb 7, 2021
fed4570
simplify loop in sorted case
bkamins Feb 9, 2021
d0fb0b9
use resize!
bkamins Feb 9, 2021
020eaae
add sizehint!
bkamins Feb 10, 2021
a3de1c2
avoid using internal functions
bkamins Feb 11, 2021
659ec7c
improved benchmark design
bkamins Feb 11, 2021
07ecb0a
Revert "avoid using internal functions"
bkamins Feb 11, 2021
58bdcf3
fix dict sizehint
bkamins Feb 11, 2021
d7bb989
add benchmark runner
bkamins Feb 11, 2021
f9882f8
Revert "Revert "avoid using internal functions""
bkamins Feb 11, 2021
8a31d99
clean up script
bkamins Feb 11, 2021
91df0e4
Update test/join.jl
bkamins Feb 11, 2021
0b21972
improve tests
bkamins Feb 12, 2021
1a9e664
Update NEWS.md
bkamins Feb 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@

## Other relevant changes

* `innerjoin` is now much faster and checks if passed data frames are sorted
by the `on` columns and takes into account if shorter data frame that is joined
has unique values in `on` columns. These aspects of input data frames might affect
the order of rows produced in the output
([#2612](https://github.com/JuliaData/DataFrames.jl/pull/2612))

# DataFrames v0.22 Release Notes

Expand Down
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ DataAPI = "1.4"
InvertedIndices = "1"
IteratorInterfaceExtensions = "0.1.1, 1"
Missings = "0.4.2"
PooledArrays = "0.5, 1.0"
PooledArrays = "1.1"
PrettyTables = "0.11"
Reexport = "0.1, 0.2, 1.0"
SortingAlgorithms = "0.1, 0.2, 0.3"
Expand Down
96 changes: 96 additions & 0 deletions benchmarks/innerjoin_performance.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
using CategoricalArrays
using DataFrames
using PooledArrays
using Random

fullgc() = (GC.gc(true); GC.gc(true); GC.gc(true); GC.gc(true))

@assert length(ARGS) == 6
@assert ARGS[3] in ["int", "pool", "cat", "str"]
@assert ARGS[4] in ["uniq", "dup", "manydup"]
@assert ARGS[5] in ["sort", "rand"]
@assert ARGS[6] in ["1", "2"]

@info ARGS

llen = parse(Int, ARGS[1])
rlen = parse(Int, ARGS[2])
@assert llen > 1000
@assert rlen > 2000

pad = maximum(length.(string.((llen, rlen))))

if ARGS[3] == "int"
if ARGS[4] == "uniq"
col1 = [1:llen;]
col2 = [1:rlen;]
elseif ARGS[4] == "dup"
col1 = repeat(1:llen ÷ 2, inner=2)
col2 = repeat(1:rlen ÷ 2, inner=2)
else
@assert ARGS[4] == "manydup"
col1 = repeat(1:llen ÷ 20, inner=20)
col2 = repeat(1:rlen ÷ 20, inner=20)
end
elseif ARGS[3] == "pool"
if ARGS[4] == "dup"
col1 = PooledArray(repeat(string.(1:llen ÷ 2, pad=pad), inner=2))
col2 = PooledArray(repeat(string.(1:rlen ÷ 2, pad=pad), inner=2))
else
@assert ARGS[4] == "manydup"
col1 = PooledArray(repeat(string.(1:llen ÷ 20, pad=pad), inner=20))
col2 = PooledArray(repeat(string.(1:rlen ÷ 20, pad=pad), inner=20))
end
elseif ARGS[3] == "cat"
if ARGS[4] == "dup"
col1 = categorical(repeat(string.(1:llen ÷ 2, pad=pad), inner=2))
col2 = categorical(repeat(string.(1:rlen ÷ 2, pad=pad), inner=2))
else
@assert ARGS[4] == "manydup"
col1 = categorical(repeat(string.(1:llen ÷ 20, pad=pad), inner=20))
col2 = categorical(repeat(string.(1:rlen ÷ 20, pad=pad), inner=20))
end
else
@assert ARGS[3] == "str"
if ARGS[4] == "uniq"
col1 = string.(1:llen, pad=pad)
col2 = string.(1:rlen, pad=pad)
elseif ARGS[4] == "dup"
col1 = repeat(string.(1:llen ÷ 2, pad=pad), inner=2)
col2 = repeat(string.(1:rlen ÷ 2, pad=pad), inner=2)
else
@assert ARGS[4] == "manydup"
col1 = repeat(string.(1:llen ÷ 20, pad=pad), inner=20)
col2 = repeat(string.(1:rlen ÷ 20, pad=pad), inner=20)
end
end

Random.seed!(1234)

if ARGS[5] == "rand"
shuffle!(col1)
shuffle!(col2)
else
@assert ARGS[5] == "sort"
end

if ARGS[6] == "1"
df1 = DataFrame(id1 = col1)
df2 = DataFrame(id1 = col2)
innerjoin(df1[1:1000, :], df2[1:2000, :], on=:id1)
innerjoin(df2[1:2000, :], df1[1:1000, :], on=:id1)
fullgc()
@time innerjoin(df1, df2, on=:id1)
fullgc()
@time innerjoin(df2, df1, on=:id1)
else
@assert ARGS[6] == "2"
df1 = DataFrame(id1 = col1, id2 = col1)
df2 = DataFrame(id1 = col1, id2 = col1)
innerjoin(df1[1:1000, :], df2[1:2000, :], on=[:id1, :id2])
innerjoin(df2[1:2000, :], df1[1:1000, :], on=[:id1, :id2])
fullgc()
@time innerjoin(df1, df2, on=[:id1, :id2])
fullgc()
@time innerjoin(df2, df1, on=[:id1, :id2])
end
2 changes: 2 additions & 0 deletions benchmarks/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
julia runtests.jl 100000 50000000
julia runtests.jl 5000000 10000000
12 changes: 12 additions & 0 deletions benchmarks/runtests.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
@assert length(ARGS) == 2
file_loc = joinpath(dirname(@__FILE__), "innerjoin_performance.jl")
llen = ARGS[1]
rlen = ARGS[2]

for a3 in ["str", "int", "pool", "cat"],
a4 in ["uniq", "dup", "manydup"],
a5 in ["sort", "rand"],
a6 in ["1", "2"]
a4 == "uniq" && a3 in ["pool", "cat"] && continue
run(`julia $file_loc $llen $rlen $a3 $a4 $a5 $a6`)
end
Loading