Memory efficiency of join #1334

bkamins · 2017-12-29T13:54:14Z

Currently join copies all data to a new table (correct me please if I am wrong here).

This can be very expensive for huge DataFrames. E.g. data.table in R allows to update a table with columns from other table in-place.

Any opinion if something like this would be a desirable option for :left and :right joins?

The text was updated successfully, but these errors were encountered:

nalimilan · 2017-12-30T19:00:07Z

Why not. I imagine we could call this join!?

(Though I think improving the efficiency of join by using optimized hashing algorithms for each combination of column types (in particular CategoricalArray) should be higher priority.)

bkamins · 2020-05-09T08:04:39Z

I have just investigated h2o benchmarks - I think improving joins performance should be a priority for the near future as we are really slow here.

anandijain · 2020-05-12T16:22:27Z

Couple of notes on my usage of join, that y'all probably know about but:

It causes repl crashes frequently, as once it eats all my ram, it keeps trying to perform join.

I don't have a great fix, but I have to work around by joining subsets of the original array of dataframes, and then joining these.

When I am able to interrupt the join, it doesn't seem like the memory is freed until you kill the repl.

bkamins · 2020-05-12T17:09:28Z

Yes - this is the reality: https://h2oai.github.io/db-benchmark/ (click on join - 50GB tests all fail, 5GB tests are very bad).

If no one else works on it probably I will have a look at the code some time in the future as this is the major thing that requires fixing in terms of performance in DataFrames.jl.

anandijain · 2020-05-12T21:42:25Z

Since you probably know a lot more of the implementation, what is causing the problem?
Hopefully I can start work on a PR, join.jl is a pretty long file

I often want to join with splat join(dfs..., on=:x1) but know that it will crash, even for relatively small dataframes, so there is probably optimization possibilities for this case.
edit: join! makes a lot of sense

bkamins · 2020-05-12T23:18:25Z

Actually it is the only part of source code that I never touched before.
For sure innerjoin(dfs..., on=:x1) can be optimized (now we just recursively perform a join), but I would start with a single join.

piever · 2020-05-14T09:58:33Z

I'm not sure if this helps, but StructArrays has a utility function to iterate pairs of ranges i0:i1, j0:j1 such that lkeys[sortperm(lkeys)[i0:i1]] and rkeys[sortperm(rkeys)[j0:j1]] are all the entries with a given value of keys. It works best if lkeys, rkeys are StructArrays whose columns are all of isbits elements (i.e. after pooling).

You would call it with:

StructArrays.GroupJoinPerm(lkeys::StructArray, rkeys::StructArray,
    lperm=sortperm(lkeys), rperm=sortperm(rkeys))

Again, sortperm(lkeys) should be pretty efficient if all cols are isbits, better if integers. For string arrays with many unique entries, probably better to call radixsort I imagine.

tkf · 2020-05-22T01:50:12Z

If you are going with the sorting direction than hashing, I guess using the galloping search could be beneficial? When I tried it with simple "dot products" sum(x1 .* x2 .* ... .* xn) on sparse vectors (which is like fused reduce-innerjoin), it was much faster than linear search (which is what GroupJoinPerm does?) when there are more than two vectors. I guess it depends on the distribution of the keys, though.

anandijain · 2020-05-22T20:41:54Z

This may be an aside error, but InterruptException on a join prevents the memory from being freed, essentially forcing a repl restart. Would there be anyway to deallocate on interrupt?

bkamins · 2020-05-22T23:22:37Z

If this happens it probably should be reported to Julia as it seems to be a bug with GC.

bkamins · 2021-10-26T19:52:00Z

closing as we now have leftjoin!

bkamins added non-breaking The proposed change is not breaking performance labels Feb 12, 2020

bkamins added the priority label May 9, 2020

bkamins added the joins label May 8, 2021

bkamins mentioned this issue May 8, 2021

add a keyword to allow specifying target row order in joins #2753

Closed

bkamins closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory efficiency of join #1334

Memory efficiency of join #1334

bkamins commented Dec 29, 2017

nalimilan commented Dec 30, 2017

bkamins commented May 9, 2020

anandijain commented May 12, 2020

bkamins commented May 12, 2020

anandijain commented May 12, 2020 •

edited

Loading

bkamins commented May 12, 2020

piever commented May 14, 2020 •

edited

Loading

tkf commented May 22, 2020

anandijain commented May 22, 2020

bkamins commented May 22, 2020

bkamins commented Oct 26, 2021

Memory efficiency of join #1334

Memory efficiency of join #1334

Comments

bkamins commented Dec 29, 2017

nalimilan commented Dec 30, 2017

bkamins commented May 9, 2020

anandijain commented May 12, 2020

bkamins commented May 12, 2020

anandijain commented May 12, 2020 • edited Loading

bkamins commented May 12, 2020

piever commented May 14, 2020 • edited Loading

tkf commented May 22, 2020

anandijain commented May 22, 2020

bkamins commented May 22, 2020

bkamins commented Oct 26, 2021

anandijain commented May 12, 2020 •

edited

Loading

piever commented May 14, 2020 •

edited

Loading