-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: track memory usage for recursive CTE, enable recursive CTEs by default #9619
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice -- thank you @jonahgao
cc @matthewgapp
) | ||
SELECT * FROM nodes;", | ||
) | ||
.with_expected_errors(vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -236,6 +237,8 @@ impl MemoryExec { | |||
pub struct MemoryStream { | |||
/// Vector of record batches | |||
data: Vec<RecordBatch>, | |||
/// Optional memory reservation bound to the data, freed on drop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a little worried at first that this optional API makes it easy to forget to provide reservation. However I see now that the reservation is used only with the recursive CTE case
(not for this PR) In general I wondered if we should always have a memory reservation to MemoryStream
🤔 I think that would double count batches from a MemTable however, so it isn't an obviously good improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This concern is quite reasonable; previously there was no need to add a MemoryReservation
to the MemoryStream
, and the with_reservation
function is a bit confused.
I plan to no longer use MemoryStream
for future recursive CTEs and instead use a new stream class that can read the WorkTable
multiple times, in order to support
https://github.com/apache/arrow-datafusion/blob/81b0a011705b17a09f494f550a5382b0c3414597/datafusion/physical-plan/src/recursive_query.rs#L316
So that the MemoryStream::with_reservation
API can be removed.
Looks great, @jonahgao—thanks for taking this on. Would it be sensible to have the worktable's update method take a And apologies that I've been a bit radio silent on this front @alamb! Been heads down on other stuff :/ |
No worries @matthewgapp -- I totally understand. Good luck and thank you for contributing the original PR 🙏 |
Make sense to me! Updated and thank you @matthewgapp |
Thanks @jonahgao and @matthewgapp 🙏 |
Which issue does this PR close?
Closes #9554.
Rationale for this change
MemoryReservation
to track the memory usage of the RecursiveQuery's buffer.MemoryReservation
are moved together to theWorkTable
, and then passed on to theMemoryStream
in the next iteration.MemoryReservation
will finally be freed after theMemoryStream
is dropped.What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No. Enable a new feature by default.