Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MemoryManager in Join operators, and return errors if the memory budget is exceeded #5220

Closed
4 tasks done
Tracked by #3941
alamb opened this issue Feb 8, 2023 · 4 comments · Fixed by #5632
Closed
4 tasks done
Tracked by #3941

Comments

@alamb
Copy link
Contributor

alamb commented Feb 8, 2023

See description on #3941

The idea is to apply the same treatment to Join operators that was done to GroupBy in #3940 and throw an error if their memory budget is exceeded rather than unlimited allocating.

All the operators in https://github.com/apache/arrow-datafusion/tree/master/datafusion/core/src/physical_plan/joins need to be evaluated.

Items to do:

cc @DDtKey and @crepererum who have expressed interest in this project in the past

@DDtKey
Copy link
Contributor

DDtKey commented Mar 14, 2023

Amazing, it looks like only sort-merge join is left!
Thanks @korowa 🚀

@korowa
Copy link
Contributor

korowa commented Mar 17, 2023

Actually there is also SymmetricHashJoinExec -- but, I suppose, its memory management is a bit more complicated and may be behind the scope of this issue:

  • It's, sort of, isolated -- at this moment there are no options for planning this operator with DataFusion planner
  • I don't think that simply throwing error and aborting execution is acceptable for SymmetricHashJoinExec -- if my understanding is correct -- main use case for this operator is joining two unbounded sources (streaming jobs), and from this point of view it doesn't make much sense to limit memory without any spilling fallbacks (subjectively, it doesn't seem correct to fail data streaming job in case of memory overallocation attempt)

My proposal here would be to file separate issue for SymmetricHashJoinExec memory management, and (as I see it) implement memory limitation along with data spilling. Maybe, we can go for it when we have reliable spilling for HashJoinExec, however, prior to it is also an option.

@ozankabak, @metesynnada it would be great to hear your thoughts on it.

@ozankabak
Copy link
Contributor

My proposal here would be to file separate issue for SymmetricHashJoinExec memory management, and (as I see it) implement memory limitation along with data spilling. Maybe, we can go for it when we have reliable spilling for HashJoinExec, however, prior to it is also an option.

At a quick glance, this sounds reasonable. But we will do some deeper thinking on this and come back to share our thoughts.

@alamb
Copy link
Contributor Author

alamb commented Mar 18, 2023

Filed #5636 for memory management of symmetric hash join

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants