add a keyword to allow specifying target row order in joins #2753

bkamins · 2021-05-07T21:34:08Z

This issue is meant to gather requirements for all join* operations regarding the resulting order of rows after the join operation. After we finish the discussion I will implement it via kwarg to appropriate functions (the default will be the fastest way to produce a result, passing a kwarg will allow user to request a specific order but we need to learn what orders for what joins would be desirable).

The text was updated successfully, but these errors were encountered:

Byrth · 2021-05-08T13:22:43Z

I recognize that performance concerns are likely what drove the change, but here is what I was expecting:

df1.__id1 = Base.OneTo(nrow(df1))
df2.__id2 = Base.OneTo(nrow(df2))
df3 = ___join(df1,df2,on=keys)

Use left table's row order and then right

left: sort!(df3, ["__id1","__id2"])
inner: sort!(df3, ["__id1","__id2"])
cross: sort!(df3, ["__id1","__id2"])
outer: sort!(df3, ["__id1","__id2"])

Use left table's row order only:

semi: sort!(df3, ["__id1"])
anti: sort!(df3, ["__id1"])

Use right table's row order and then left

right: sort!(df3, ["__id2","__id1"])

Argument in favor of ordering

leftjoin, semijoin, antijoin, and rightjoin all have a clearly implied direction where one table is joined onto the other table (or only one table remains). A naive implementation of these would simply iterate row-wise over the primary table and look for hash matches in the secondary table. That's where their ordering proposals above come from.

innerjoin, outerjoin, and crossjoin are ambiguous because they consider rows for both sides. Without rightjoin, table primacy could be consistently given to the first argument and the intuition for these would be obvious. In fact, the above proposal for the join sort order would actually be the same for all joins if only rightjoin didn't exist. So I guess I went with the leftjoin ordering for inner/outer/cross because I don't use rightjoin, but I admit that it's ambiguous.

Argument against ordering

The ANSI SQL standard does not define a row order resulting from joins. I can only find a free draft from 1992 but section 7.5 doesn't seem to mention it. All join implementation documentation that I have seen is also silent on the issue.

It's also obviously more computationally expensive to do ordering and offers no practical benefit if rows are replicated in both tables (like in a crossjoin).

But what if rows are replicated in only one table? (i.e. the argument for `leftjoin!`)

Maintaining row order in one table would allow for a copycols=false-like keyword argument to joins where you re-use that table's vectors if possible. This can save a bunch of allocations if one is, as I typically am, joining a very large table to smaller ones with no row replication in the very large table.

Joins in DataFrames.jl copy all vectors, it used to be much more efficient to do something like what is in the leftjoin!() below because you avoid copying columns <y.

using DataFrames # 0.22.7

nr = 1000000

df1 = DataFrame(a = rand(nr), b = rand(nr), c = rand(nr), d = rand(nr),
    e = rand(nr), f = rand(nr), g = rand(nr), h = rand(nr), i = rand(nr),
     j = rand(nr), k = rand(nr), l = rand(nr), m = rand(nr), n = rand(nr),
    o = rand(nr), p = rand(nr), q = rand(nr), r = rand(nr), s = rand(nr),
    t = rand(nr), u = rand(nr), v = rand(nr), w = rand(nr), x = rand(nr),
    y = rand(['a','b','c'], nr));
df2 = DataFrame(y = ['a','b','c'], z = [1,2,3]);

function leftjoin!(df1, df2; on)
    df_tmp = leftjoin(select(df1, on; copycols=false), df2, on=on); # order of df3 matches df1 if the join is unique
    df1.z = df_tmp.z
    df1
end

df0 = @time copy(df1; copycols=true); # 0.204061 seconds (94 allocations: 186.927 MiB) # only necessary for testing in my case
df3 = @time leftjoin!(df0,df2;on="y"); # 0.360305 seconds (246 allocations: 91.449 MiB)
df3 = @time leftjoin(df1, df2, on="y"); # 0.691530 seconds (589 allocations: 274.568 MiB)

Upgrading to DataFrames 1.1.1, the order of df1 is not preserved in the join so the leftjoin! has to get more complicated:

function leftjoin!(df1, df2; on)
    df1.__id = Base.OneTo(UInt32(nrow(df1)))
    df_tmp = leftjoin(select(df1, union([on],["__id"]); copycols=false), df2, on=on); # order of df3 matches df1 if the join is unique
    sort!(df_tmp, "__id")
    df1.z = df_tmp.z
    select!(df1, Not("__id"))
end

df0 = @time copy(df1; copycols=true); # 0.053088 seconds (322 allocations: 186.940 MiB) # only necessary for testing in my case
df3 = @time leftjoin!(df0,df2;on="y"); # 0.088398 seconds (513 allocations: 53.436 MiB)
df3 = @time leftjoin(df1, df2, on="y"); # 0.123257 seconds (854 allocations: 213.671 MiB)

So the order in particular doesn't matter much to me and even the more complicated wrapper is still way more efficient/faster than it was in 0.22.7, but it used to make it a lot easier to avoid allocations.

If you imagine:

df2 = leftjoin(leftjoin(leftjoin(df, model1, on=keys1), model2, on=keys2), model3, on=keys3)

then only the inner-most leftjoin ever needs to copy columns and the outer two could be leftjoin!(), about a 50% decrease in amount of RAM allocated using the above totals.

Summary

In the course of writing this post, I realized that instead of needing the order to be maintained, I really just need to write the leftjoin!() wrapper for myself and import it. It isn't that much more complicated than before and it is still way more efficient.

bkamins · 2021-05-08T15:51:08Z

the ! version of join is also on a to-do list, see #1334 and #2259.

bkamins · 2021-06-03T18:56:48Z

@pstorozenko - this is the next thing to think of in joins if you want 😄. There are two layers:

thinking of useful API; currently I think of order kwarg
designing an efficient implementation

Even starting with the minimal (but most common case), i.e. adding it only to leftjoin with order=:left would probably cover most of the use cases (and if we had this then we could think of extensions which should follow naturally).

pstorozenko · 2021-06-03T19:53:13Z

Thanks, but I'd like to understand better how to perform efficient joins in the first place. After looking at h2o benchmarks I want to take a deep dive into polars and data.table way of joining and compare it with DataFrames.jl. This might be very beneficial in the future :)

bkamins · 2021-06-03T19:56:08Z

Indeed this also makes a lot of sense. Could you please share a way to contact you directly (my e-mail is [email protected])? I think it would be good to discuss it and could share you my experience.

bkamins · 2021-06-03T19:57:29Z

BTW: the benchmarks are currently re-calculated and I expect improvements for joins in 5GB case (currently we are bogged down by GC). Hopefully tomorrow we will have an update on H2O website (the solution I proposed now does not resolve all problems but it should improve things).

bkamins · 2021-07-29T20:35:29Z

TODO: if we add this then we should also add leftjoin! and rightjoin! that would update left/right table respectively.

pdeffebach · 2022-07-21T20:39:52Z

Bumping this, I ran into this today.

I think we can add a keyword argument for preserving order without adding leftjoin!.

bkamins · 2022-07-21T21:44:33Z

Well this issue is super old. We added leftjoin! in the mean time that preserves order. I will mark it for 1.5 release.

pdeffebach · 2022-07-21T21:48:17Z

Ahh I forgot about that. Thanks.

bkamins added feature joins priority labels May 7, 2021

bkamins added this to the 1.x milestone May 7, 2021

bkamins modified the milestones: 1.x, 1.3 Aug 25, 2021

bkamins changed the title ~~add a keyword to allow specifying target column order in joins~~ add a keyword to allow specifying target row order in joins Jul 21, 2022

bkamins modified the milestones: 1.x, 1.5 Jul 21, 2022

bkamins mentioned this issue Nov 26, 2022

Add an option in joins to specify row order #3233

Merged

bkamins closed this as completed in #3233 Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a keyword to allow specifying target row order in joins #2753

add a keyword to allow specifying target row order in joins #2753

bkamins commented May 7, 2021

Byrth commented May 8, 2021

bkamins commented May 8, 2021

bkamins commented Jun 3, 2021

pstorozenko commented Jun 3, 2021

bkamins commented Jun 3, 2021

bkamins commented Jun 3, 2021

bkamins commented Jul 29, 2021

pdeffebach commented Jul 21, 2022

bkamins commented Jul 21, 2022

pdeffebach commented Jul 21, 2022

add a keyword to allow specifying target row order in joins #2753

add a keyword to allow specifying target row order in joins #2753

Comments

bkamins commented May 7, 2021

Byrth commented May 8, 2021

Use left table's row order and then right

Use left table's row order only:

Use right table's row order and then left

Argument in favor of ordering

Argument against ordering

But what if rows are replicated in only one table? (i.e. the argument for leftjoin!)

Summary

bkamins commented May 8, 2021

bkamins commented Jun 3, 2021

pstorozenko commented Jun 3, 2021

bkamins commented Jun 3, 2021

bkamins commented Jun 3, 2021

bkamins commented Jul 29, 2021

pdeffebach commented Jul 21, 2022

bkamins commented Jul 21, 2022

pdeffebach commented Jul 21, 2022

But what if rows are replicated in only one table? (i.e. the argument for `leftjoin!`)