Memory account not adding up in SortExec #10073

westonpace · 2024-04-14T02:45:54Z

Describe the bug

This is related / tangential to #9359

My primary problem is that I am trying to sort 100 million strings and always getting errors like:

Resources exhausted: Failed to allocate additional 39698712 bytes for ExternalSorterMerge[0] with 4053021824 bytes already allocated - maximum available is 26915520

After reading through the sort impl a bit I have noticed a few concerns (recorded in additional context)

To Reproduce

I don't have a df reproduction but this reproduces it for me in lance:

import pyarrow as pa
import lance

# Once the dataset has been generated you can comment out this generation code and reproduce the issue very quickly
print("Generating data")
my_strings = [f"string-{i}" * 3 for i in range(100 * 1024 * 1024)]                                                                                                                                               
my_table = pa.table({"my_strings": my_strings})                                                                                                                                                                  

print("Writing dataset")
ds = lance.write_dataset(                                                                                                                                                                                        
    my_table, "/tmp/big_strings.lance", mode="overwrite", schema=my_table.schema                                                                                                                                 
)                                                                                                                                                                                                                
del my_table                                                                                                                                                                                                     
# End of generation code

ds = lance.dataset("/tmp/big_strings.lance")
print("Training scalar index")
# To create a scalar index we must sort the column, this is where the error occurs
ds.create_scalar_index("my_strings", "BTREE")

Expected behavior

I can sort any number of strings, as long as I don't overflow the disk

Additional context

Here is how I understand memory accounting in the sort today:

As batches arrive, they are placed in an accumulation queue, and the size of the spillable reservation grows
Once the pool is out of space we begin the spill process
- The first part of the spill process is to sort the accumulation queue (which, at this point, has many batches in it)
- Each batch becomes an input stream for a SortPreservingMergeStream (this is a LOT of inputs).
- When the batch input stream is polled the batch is sorted, the unsorted batch is dropped (and removed from the spillable reservation) and the sorted batch is returned. The SortPreservingMergeStream then puts this batch in the batch builder (which adds it to the non-spillable reservation). This is a problem, as described below
- As the sort preserving merge stream is polled it polls the various inputs, fills up the batch builder, and then starts to emit output batches
Back in the sort exec the sort preserving merge stream is fully drained (try_collect). This is a problem, as described below
These collected and sorted batches are then written to the spill file.

The first problem (and the one causing my error) is that a sorted batch of strings (the output of sort_batch) is occupying 25% more memory than the unsorted batch of strings. I'm not sure if this buffer alignment, padding, or some kind of 2x allocation strategy used by the sort, but it seems reasonable something like this could happen. Unfortunately, this is a problem. We are spilling because we have used up the entire memory pool. We now take X bytes from the memory pool, convert it into 1.25 * X bytes, and try to put it back in the memory pool. This fails with the error listed above.

The second problem is that we are not accounting for the output of the sort perserving merge stream. Each output batch from the sort preserving merge stream is made up of rows from the various input batches. In the degenerate case, where the input data is fully random, this means we will probably require 2 * X bytes. This is because each output batch is made up of 1 batch from each input stream. We can't release any of the input batches until we emit the final output batch.

The solution to this second problem is that we should be streaming into the spill file. We should not collect from the sort preserving merge stream and then write the collected batches into the spill file. This problem is a bit less concerning for me at the moment because it is "datafusion uses more memory than it should" and not "datafusion is failing the plan with an error". We don't do a lot of sorting in lance and so we can work around it reasonably well by halving the size of the spill pool.

The text was updated successfully, but these errors were encountered:

alamb · 2024-04-15T11:11:25Z

It sounds like the issue, at a (really) high level is "additional buffer space is required to actually implement the spill"

And since during spill the plan is under memory pressure, getting this additional memory can and does fail

Some strategies I can think of are:

Simply turn off the memory accounting of intermediate results (String batches in your example) above during the spilling process (pro: simpler to implement I think, con: overshoots limits)
Reserve additional buffer space up front to be used during spill (e.g. set aside 50MB). (pro: won't overshoot, cons: not clear how much is "enough" and will reduce amount of memory that can be reserved
Reduce the memory required for intermediate spilling (e.g. maybe use a batch size 1/2 the size)

Maybe we can do 1 in the sort term while figuring out a more sophisticated strategy for 2 or 3

comphead · 2024-04-15T19:51:48Z

I'm investigating this as part of #9359

yjshen · 2024-08-25T19:17:48Z

Through examining the current implementation of multi-column sort's spill-to-disk strategies, I find we are asking for more memory during spill, which I think is worth discussing:
During the spill, Rows are created for comparison efficiency for each in-memory RecordBatch. Considering why we spill in the first place, does this Rows optimization increase the possibility of execution failure due to memory shortage?

I find this also related. #9528 (comment)

yjshen · 2024-08-25T20:37:35Z

Another point of code worth noticing is inside the current sort_batch implementation:

datafusion/datafusion/physical-plan/src/sorts/sort.rs

Lines 601 to 607 in 79fa6f9

    
           let indices = if is_multi_column_with_lists(&sort_columns) { 
        
               // lex_sort_to_indices doesn't support List with more than one column 
        
               // https://github.com/apache/arrow-rs/issues/5454 
        
               lexsort_to_indices_multi_columns(sort_columns, fetch)? 
        
           } else { 
        
               lexsort_to_indices(&sort_columns, fetch)? 
        
           };

Performance-wise, I think it's beneficial to apply the row format comparison to all multi-column cases, however, while considering sort_batch is used in multiple places where spill is being called; creating rows before comparing would introduce more memory pressure.

BTW, I think we should report memory usage inside lexsort_to_indices_multi_columns:

datafusion/datafusion/physical-plan/src/sorts/sort.rs

Lines 650 to 652 in 79fa6f9

    
           let converter = RowConverter::new(fields)?; 
        
           let rows = converter.convert_columns(&columns)?; 
        
           let mut sort: Vec<_> = rows.iter().enumerate().collect();

alamb · 2024-11-21T14:29:16Z

I think @2010YOUY01 may have fixed this recently 🤔

bfcrampton · 2025-01-06T05:12:54Z

FWIW I'm still seeing the same issue through LanceDB (lancedb/lance#2119 (comment)).

westonpace · 2025-01-06T15:04:19Z

FWIW I'm still seeing the same issue through LanceDB (lancedb/lance#2119 (comment)).

This isn't necessarily indicative as Lance lags behind Datafusion (currently we are at 42 which is 4 months behind). However, I just updated my local lance to release 44 (which should contain the potential fix @alamb is alluding to) and confirmed that the issue is still not fixed.

This also doesn't surprise me. I think the issue here is not double-counting but rather is dealing with the fact that a string array uses more memory after sorting than it was using before sorting (and so we run out of memory trying to spill).

I'll try and find some time today to create a pure datafusion reproducer.

westonpace · 2025-01-06T17:10:34Z

Here's a pure-rust datafusion-only example: westonpace@26ed75c

It takes a bit of time the first run to generate the strings test file (it probably doesn't need to be so big). After that it reproduces the issue quickly.

I've also added some prints that hopefully highlight the issue. Before we do an in-memory sort we have ~5MB of unsorted string data. After sorting we have 8MB of sorted string data.

This is not surprising to me. During the sort we are probably building a string array and probably using some kind of resize-on-append string building that is doubling and we end up with ~8MB because the amount we need is between 4MB and 8MB.

Unfortunately, this leads to a failure which it probably should not do. I think @alamb had some good suggestions in this comment

alamb · 2025-01-07T11:34:35Z

This is not surprising to me. During the sort we are probably building a string array and probably using some kind of resize-on-append string building that is doubling and we end up with ~8MB because the amount we need is between 4MB and 8MB.

This is a good call - maybe there is a factor of 2 less memory if we allocated the correct capacity up front somehow.

westonpace added the bug Something isn't working label Apr 14, 2024

comphead mentioned this issue May 4, 2024

[Epic] Remove Sort Merge Join Experimental status #9846

Open

19 tasks

westonpace mentioned this issue May 29, 2024

Expose spill configuration to users lancedb/lance#2119

Open

alamb mentioned this issue Aug 23, 2024

External sorting not working for (maybe only for string columns??) #12136

Open

cjolowicz mentioned this issue Sep 2, 2024

Fix allocation failures during sorting with spill-to-disk #12288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory account not adding up in SortExec #10073

Memory account not adding up in SortExec #10073

westonpace commented Apr 14, 2024 •

edited

Loading

alamb commented Apr 15, 2024

comphead commented Apr 15, 2024

yjshen commented Aug 25, 2024

yjshen commented Aug 25, 2024

alamb commented Nov 21, 2024

bfcrampton commented Jan 6, 2025

westonpace commented Jan 6, 2025

westonpace commented Jan 6, 2025 •

edited

Loading

alamb commented Jan 7, 2025

Memory account not adding up in SortExec #10073

Memory account not adding up in SortExec #10073

Comments

westonpace commented Apr 14, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Apr 15, 2024

comphead commented Apr 15, 2024

yjshen commented Aug 25, 2024

yjshen commented Aug 25, 2024

alamb commented Nov 21, 2024

bfcrampton commented Jan 6, 2025

westonpace commented Jan 6, 2025

westonpace commented Jan 6, 2025 • edited Loading

alamb commented Jan 7, 2025

westonpace commented Apr 14, 2024 •

edited

Loading

westonpace commented Jan 6, 2025 •

edited

Loading