Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory efficiency of join #1334

Closed
bkamins opened this issue Dec 29, 2017 · 11 comments
Closed

Memory efficiency of join #1334

bkamins opened this issue Dec 29, 2017 · 11 comments
Labels
joins non-breaking The proposed change is not breaking performance priority

Comments

@bkamins
Copy link
Member

bkamins commented Dec 29, 2017

Currently join copies all data to a new table (correct me please if I am wrong here).

This can be very expensive for huge DataFrames. E.g. data.table in R allows to update a table with columns from other table in-place.

Any opinion if something like this would be a desirable option for :left and :right joins?

@nalimilan
Copy link
Member

Why not. I imagine we could call this join!?

(Though I think improving the efficiency of join by using optimized hashing algorithms for each combination of column types (in particular CategoricalArray) should be higher priority.)

@bkamins bkamins added non-breaking The proposed change is not breaking performance labels Feb 12, 2020
@bkamins
Copy link
Member Author

bkamins commented May 9, 2020

I have just investigated h2o benchmarks - I think improving joins performance should be a priority for the near future as we are really slow here.

@anandijain
Copy link
Contributor

Couple of notes on my usage of join, that y'all probably know about but:

It causes repl crashes frequently, as once it eats all my ram, it keeps trying to perform join.

I don't have a great fix, but I have to work around by joining subsets of the original array of dataframes, and then joining these.

When I am able to interrupt the join, it doesn't seem like the memory is freed until you kill the repl.

@bkamins
Copy link
Member Author

bkamins commented May 12, 2020

Yes - this is the reality: https://h2oai.github.io/db-benchmark/ (click on join - 50GB tests all fail, 5GB tests are very bad).

If no one else works on it probably I will have a look at the code some time in the future as this is the major thing that requires fixing in terms of performance in DataFrames.jl.

@anandijain
Copy link
Contributor

anandijain commented May 12, 2020

Since you probably know a lot more of the implementation, what is causing the problem?
Hopefully I can start work on a PR, join.jl is a pretty long file

I often want to join with splat join(dfs..., on=:x1) but know that it will crash, even for relatively small dataframes, so there is probably optimization possibilities for this case.
edit: join! makes a lot of sense

@bkamins
Copy link
Member Author

bkamins commented May 12, 2020

Actually it is the only part of source code that I never touched before.
For sure innerjoin(dfs..., on=:x1) can be optimized (now we just recursively perform a join), but I would start with a single join.

@piever
Copy link

piever commented May 14, 2020

I'm not sure if this helps, but StructArrays has a utility function to iterate pairs of ranges i0:i1, j0:j1 such that lkeys[sortperm(lkeys)[i0:i1]] and rkeys[sortperm(rkeys)[j0:j1]] are all the entries with a given value of keys. It works best if lkeys, rkeys are StructArrays whose columns are all of isbits elements (i.e. after pooling).

You would call it with:

StructArrays.GroupJoinPerm(lkeys::StructArray, rkeys::StructArray,
    lperm=sortperm(lkeys), rperm=sortperm(rkeys))

Again, sortperm(lkeys) should be pretty efficient if all cols are isbits, better if integers. For string arrays with many unique entries, probably better to call radixsort I imagine.

@tkf
Copy link
Contributor

tkf commented May 22, 2020

If you are going with the sorting direction than hashing, I guess using the galloping search could be beneficial? When I tried it with simple "dot products" sum(x1 .* x2 .* ... .* xn) on sparse vectors (which is like fused reduce-innerjoin), it was much faster than linear search (which is what GroupJoinPerm does?) when there are more than two vectors. I guess it depends on the distribution of the keys, though.

@anandijain
Copy link
Contributor

This may be an aside error, but InterruptException on a join prevents the memory from being freed, essentially forcing a repl restart. Would there be anyway to deallocate on interrupt?

@bkamins
Copy link
Member Author

bkamins commented May 22, 2020

If this happens it probably should be reported to Julia as it seems to be a bug with GC.

@bkamins
Copy link
Member Author

bkamins commented Oct 26, 2021

closing as we now have leftjoin!

@bkamins bkamins closed this as completed Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
joins non-breaking The proposed change is not breaking performance priority
Projects
None yet
Development

No branches or pull requests

5 participants