Some memory reservations of GroupedHashAggregateStream seem to be mis-tagged as spillable while they do not allow spilling #11390

Ablu · 2024-07-10T14:18:17Z

Describe the bug

When using a grouping (I tested with distinct_on) in combination with a FairSpillPool not all memory seems to be tagged correctly. While the grouping itself does gracefully handle allocation errors by spilling, the same reservation is also shared with a BatchBuilder that does not.

To Reproduce

(Tested on Linux, Fedora 40)

Clone my reproducer: https://github.com/Ablu/datafusion-repro-resource-allocation (chances are that this can be simplified further)
Generate test file: cargo run --release -- generate out.parquet
increase open file limit: ulimit -nS 102400
Attempt to run the deduplication query: cargo run --release -- deduplicate out.parquet dedup.parquet

Failure to write back final result: ResourcesExhausted("Failed to allocate additional 8939616 bytes for GroupedHashAggregateStream[10] with 58373092 bytes already allocated - maximum available is 62500000")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

It looks like it tries to acquire from the smaller "spillable" part of the memory while it should probably allocate with a non-spillable reservation.

As a hack I remove the offending resource allocations:

diff --git a/datafusion/physical-plan/src/sorts/builder.rs b/datafusion/physical-plan/src/sorts/builder.rs
index d32c60697..d0305c0bf 100644
--- a/datafusion/physical-plan/src/sorts/builder.rs
+++ b/datafusion/physical-plan/src/sorts/builder.rs
@@ -65,15 +65,15 @@ impl BatchBuilder {
             indices: Vec::with_capacity(batch_size),
             reservation,
         }
     }
 
     /// Append a new batch in `stream_idx`
     pub fn push_batch(&mut self, stream_idx: usize, batch: RecordBatch) -> Result<()> {
-        self.reservation.try_grow(batch.get_array_memory_size())?;
+        // dbg!(self.reservation.try_grow(batch.get_array_memory_size()))?;
         let batch_idx = self.batches.len();
         self.batches.push((stream_idx, batch));
         self.cursors[stream_idx] = BatchCursor {
             batch_idx,
             row_idx: 0,
         };
         Ok(())
@@ -137,15 +137,15 @@ impl BatchBuilder {
             let retain = stream_cursor.batch_idx == batch_idx;
             batch_idx += 1;
 
             if retain {
                 stream_cursor.batch_idx = retained;
                 retained += 1;
             } else {
-                self.reservation.shrink(batch.get_array_memory_size());
+                // self.reservation.shrink(batch.get_array_memory_size());
             }
             retain
         });
 
         Ok(Some(RecordBatch::try_new(
             Arc::clone(&self.schema),
             columns,

That avoids the immediate error. It looks like we progress further, but eventually end in some kind of deadlock (that I have not fully understand yet, but it does not seem to be related to this hack?)

Expected behavior

The query + writeback should pass after reading back spilled data.

Additional context

Originally discussed/analyzed on Discord: https://discord.com/channels/885562378132000778/1166447479609376850/1260302583637999756

The text was updated successfully, but these errors were encountered:

alamb · 2024-07-10T20:26:12Z

cc @kazuyukitanimura

Ablu · 2024-08-05T10:53:47Z

Simply sorting seems to run into the same kind of problems:

External(External(ResourcesExhausted("Failed to allocate additional 2470912 bytes for ExternalSorterMerge[0] with 530680832 bytes already allocated for this reservation - 2079445 bytes remain available for the total pool")))

EDIT: that is #5108 probably?

alamb · 2024-08-08T11:31:48Z

Hi @Ablu -- it could be.

In general I think our support of ensuring a sort can spill might need some tweaking / help -- once the system is under memory pressure (aka the pool is almost exhausted) several operations to actually spill require (more) memory to actually do so (such as merging the in memory buffers)

So in other words, the actual process of trying to spill can itself trigger the "resources exhausted" error

One approach I can think of is to turn off resource checking while spilling -- this would be simple to implement, but would result in a memory overshoot.

An alternate is to pre-reserve memory for the spilling before memory pressure is too high, but this would likely result in memory that is almost never used (only held in reserve in case it was needed)

Ablu added the bug Something isn't working label Jul 10, 2024

This was referenced Nov 12, 2024

Fix record batch memory size double counting #13377

Merged

Replace record_batch.get_array_memory_size() in spilling operators #13430

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some memory reservations of GroupedHashAggregateStream seem to be mis-tagged as spillable while they do not allow spilling #11390

Some memory reservations of GroupedHashAggregateStream seem to be mis-tagged as spillable while they do not allow spilling #11390

Ablu commented Jul 10, 2024 •

edited

Loading

alamb commented Jul 10, 2024

Ablu commented Aug 5, 2024 •

edited

Loading

alamb commented Aug 8, 2024

Some memory reservations of GroupedHashAggregateStream seem to be mis-tagged as spillable while they do not allow spilling #11390

Some memory reservations of GroupedHashAggregateStream seem to be mis-tagged as spillable while they do not allow spilling #11390

Comments

Ablu commented Jul 10, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Jul 10, 2024

Ablu commented Aug 5, 2024 • edited Loading

alamb commented Aug 8, 2024

Ablu commented Jul 10, 2024 •

edited

Loading

Ablu commented Aug 5, 2024 •

edited

Loading