-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory efficiency of join #1334
Comments
Why not. I imagine we could call this (Though I think improving the efficiency of |
I have just investigated h2o benchmarks - I think improving joins performance should be a priority for the near future as we are really slow here. |
Couple of notes on my usage of It causes repl crashes frequently, as once it eats all my ram, it keeps trying to perform join. I don't have a great fix, but I have to work around by joining subsets of the original array of dataframes, and then joining these. When I am able to interrupt the join, it doesn't seem like the memory is freed until you kill the repl. |
Yes - this is the reality: https://h2oai.github.io/db-benchmark/ (click on If no one else works on it probably I will have a look at the code some time in the future as this is the major thing that requires fixing in terms of performance in DataFrames.jl. |
Since you probably know a lot more of the implementation, what is causing the problem? I often want to join with splat |
Actually it is the only part of source code that I never touched before. |
I'm not sure if this helps, but StructArrays has a utility function to iterate pairs of ranges You would call it with: StructArrays.GroupJoinPerm(lkeys::StructArray, rkeys::StructArray,
lperm=sortperm(lkeys), rperm=sortperm(rkeys)) Again, |
If you are going with the sorting direction than hashing, I guess using the galloping search could be beneficial? When I tried it with simple "dot products" |
This may be an aside error, but InterruptException on a join prevents the memory from being freed, essentially forcing a repl restart. Would there be anyway to deallocate on interrupt? |
If this happens it probably should be reported to Julia as it seems to be a bug with GC. |
closing as we now have |
Currently
join
copies all data to a new table (correct me please if I am wrong here).This can be very expensive for huge
DataFrame
s. E.g.data.table
in R allows to update a table with columns from other table in-place.Any opinion if something like this would be a desirable option for
:left
and:right
joins?The text was updated successfully, but these errors were encountered: