-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements in HashAggregationExec when spilling #7858
Comments
Maybe @kazuyukitanimura can help me with this issue |
Other potential "strategies" to avoid the memory overshoots might be reserve 2x the memory needed for the hash table, to account for sorting on spill,Pros: ensures the memory budget is preserved Cons: will cause more smaller spill files, and will cause some queries to spill even if they could have fit entirely in memory (more than half) write unsorted data to spill fileThe process would look like
Pros: ensures we don't spill unless the memory reservation is actually exhausted Cons: Each row is now read/rewritten twice rather than just once |
Thank you @milenkovicm for stress-testing and the improvement ideas. For 1 & 2, I think we can break down the By this comment For 3, if Cheers! |
Thanks @kazuyukitanimura , @alamb for your comments, I guess there is no perfect way to address 1. Another alternative for 1 might be to sort data in this would make state sorted by key, thus no need for There should be conversion from I get your idea regarding point 3, @kazuyukitanimura. I've tried to tune memory and batch size, what I observed that once the system is in border area this |
I find it very hard to reason about |
@milenkovicm is there anything we can do to help make it easier to reason about? Specifically maybe I can try and improve the documentation if you can explain what specifically you find hard to reason about |
Thanks @alamb, leave it with me for now. |
Thanks for the additional context @milenkovicm |
Yes that is right. I don't see avalanche of small spills and system does not get out of memory. Will try to put it under more test, and capture spill sizes |
I've done some more testing with spilling and as I mentioned in my previous message I don't see "small" spills.
If we take simple physical plan:
I have noticed that My guess is that unbounded channel in Maybe blocking until there is some free space is available would make sense instead of aggressively returning out of memory error, but not quite sure how would be implemented. |
I think another issue is that the RepartitionExec reserves memory on demand, so if memory is almost out (as it has been consumed by the AggregateExec) it may not be able to get even a small amount. The way to get the query to run is to restrict the memory used by the AggregateExec so there is still reasonable memory available for other operators One potential thing would be to use a different memory strategy (e.g. a |
|
Memory management is fun, still looking at it :) Two questions, first one should be: let reservation = MemoryConsumer::new(name).with_can_spill(true).register(context.memory_pool()); Second question, can impl MemoryReservation {
pub fn consumer(&self) -> &MemoryConsumer {
&self.registration.consumer
}
} |
I think this is a good point |
It might explain why |
@alamb & @kazuyukitanimura |
Hi @alamb & @kazuyukitanimura, |
I think tracking outstanding bugs / known potential deficiencies would be a good idea. Thank you for your diligence @milenkovicm |
Thank you @milenkovicm I agree that we can create future work tickets and close this one. |
as per our discussion, closing this task as I've captured leftovers in #8428 |
Thank you @milenkovicm ❤️ |
Is your feature request related to a problem or challenge?
I'd like to share few observations after putting
hash aggregation
( #7400) on a stress test. Datafusion v32 used.Before I start, I apologise if I did not get something correctly, aggregation code has changed a lot since last time I had a look.
I've created a test case which should put under pressure hash aggregation, forcing it to spill, so I can observe query behaviour. Data contains about 250M rows, with 50M unique
uid
s, which are going to be used to as aggregation key. It is around 3GB parquet file.Memory pool is configured as follows:
with 4 target partitions.
Query is rather simple:
I was expecting that query will not fail with
ResourcesExhausted
as aggregation would spill if under memory pressure, this was not the case.Describe the solution you'd like
Few low hanging fruits which can be addressed:
sort_batch
copies data, if I'm not mistaken. As spill is usually triggered under memory pressure, in most cases for all partitions around same time, it effectively doubles memory needed (in most cases I've observed ~2.5x more memory used than set up for memory pool).https://github.com/apache/arrow-datafusion/blob/7acd8833cc5d03ba7643d4ae424553c7681ccce8/datafusion/physical-plan/src/aggregates/row_hash.rs#L672
https://github.com/apache/arrow-datafusion/blob/7acd8833cc5d03ba7643d4ae424553c7681ccce8/datafusion/physical-plan/src/aggregates/row_hash.rs#L676
https://github.com/apache/arrow-datafusion/blob/7acd8833cc5d03ba7643d4ae424553c7681ccce8/datafusion/physical-plan/src/aggregates/row_hash.rs#L591
and
https://github.com/apache/arrow-datafusion/blob/7acd8833cc5d03ba7643d4ae424553c7681ccce8/datafusion/physical-plan/src/aggregates/row_hash.rs#L699
Available improvements, IMHO:
before spill, we could split batch into smaller blocks, sort those smaller blocks and write spill file per block, at the moment we write single file. Not sure what would be strategy for splitting batch into smaller blocks, we should take into account not to have to many open files as well.
Write more than one batch per spill
if
at ?https://github.com/apache/arrow-datafusion/blob/7acd8833cc5d03ba7643d4ae424553c7681ccce8/datafusion/physical-plan/src/aggregates/row_hash.rs#L591
and change
if
athttps://github.com/apache/arrow-datafusion/blob/7acd8833cc5d03ba7643d4ae424553c7681ccce8/datafusion/physical-plan/src/aggregates/row_hash.rs#L699
to
self.group_values.len() > 0
it would make more sense to "send" smaller batch than fail with
ResourcesExhausted
Describe alternatives you've considered
No other alternatives considered at the moment
Additional context
I have disabled (commented out) resource accounting in
RepartitionExec
as it would be the first one to freak out withResourcesExhausted
. From what I observed,RepartitionExec
would hold memory for a few batches of data when it raisesResourcesExhausted
. Change made in #4816 make sense, but in my test they were the first one to give up, before aggregation spill can occur.The text was updated successfully, but these errors were encountered: