Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526

yjshen · 2022-01-07T11:28:43Z

Which issue does this PR close?

Closes #587 .

Rationale for this change

When DataFusion processes a single partition, it will keep allocating memory until the OS or the container system kills it. To make it worse, concurrently executing partitions or even simultaneously running plans will compete for available memory until all memory is exhausted. It is more challenging to meet the memory requirements for all operators of each partition when it is running. None of the partitions or plans would run to finish.

Therefore, the ability to control the total memory usage of the process as a whole, and at the same time, allocate the available memory to each execution partition is extremely important. Under this guarantee: when the memory is sufficient, the operator can acquire as much of the memory to do the computation; when the memory is tight, the operator can be downgraded to use the disk to store some intermediate results (spilling to disk) and use a limited memory for execution.

What changes are included in this PR?

The proposed memory management architecture is the following:

User designates max execution memory by setting RuntimeConfig.max_memory and RuntimeConfig.memory_fraction (float64 between 0..1). The actual max memory DataFusion could use pool_size = max_memory * memory_fraction.
The entities that take up memory during its execution are called Memory Consumers. Operators or others are encouraged to register themselves to the memory manager and report its usage through mem_used().
There are two kinds of consumers:
- Controlling consumers that would acquire memory during its execution and release memory through spill if no more memory is available.
- Tracking consumers that exist for reporting purposes to provide a more accurate memory usage estimation for memory consumers.
Controlling and tracking consumers share the pool. Each controlling consumer could acquire a maximum of
(pool_size - all_tracking_used) / active_num_controlling_consumers.

            Memory Space for the DataFusion Lib / Process of `pool_size`
   ┌──────────────────────────────────────────────z─────────────────────────────┐
   │                                              z                             │
   │                                              z                             │
   │               Controlling                    z          Tracking           │
   │            Memory Consumers                  z       Memory Consumers      │
   │                                              z                             │
   │                                              z                             │
   └──────────────────────────────────────────────z─────────────────────────────┘

Are there any user-facing changes?

Users could limit the max memory used for DataFusion through RuntimeConfig::max_memory and RuntimeConfig::memory_fraction

Note

In addition to the proposed memory manager as well as the runtime that plumbing the execute API, an ExternalSortExec is implemented to illustrate the API usage.

alamb · 2022-01-07T21:33:48Z

Thanks @yjshen -- I'll try and give this a good look over the weekend

houqp

Very exciting stuff, thanks @yjshen :)

datafusion/src/execution/memory_manager.rs

houqp · 2022-01-08T02:04:38Z

datafusion/src/execution/memory_manager.rs

+    /// Initialize
+    pub(crate) fn initialize(self: &Arc<Self>) {
+        let manager = self.clone();
+        let handle = task::spawn(async move {


Am i correct that this background refresh process is needed because tracking consumer memory updates are managed internally instead of through MemoryManager?

Yes, I'm creating a background task that runs periodically to update tracking consumers' total memory usage, to avoid controlling consumers to ask for available memory frequently.

I haven't put much thought into this yet, but I am curious what are your thoughts on having tracking consumers also report memory usage update directly to the memory manager? Basically similar to what we have with the controlling consumers, but without the capability to force them to spill.

The main reason is to reduce interaction with the memory manager during one's execution, to reduce complexity as well as eliminate synchronization needs.

For tracking consumers that were converted from controlling consumers. for example, the hashtable size / partial sort in-mem size is known when tracking consumer is created or transformed to, then no more need for them to acquire memory or interact with memory manager.

For other tracking consumers with internal computational buffers. One can report its usage by simply updating its internal state mem_used, no extra function calls or interaction with the memory manager during execution.

I think the idea of periodically polling tracking consumers is reasonable.

I am a little worried about a task that polls based on some clock interval, however -- it is likely that the frequency will be too fast or two slow.

What about updating tracking consumers every call to try_grow? or query to the memory manager for total memory used?

There are no more maintained tracker_total and no more background maintaining tasks. Memory manager decides total tracker memory each time its can_grow is called now.

datafusion/src/execution/memory_manager.rs

liukun4515 · 2022-01-08T13:26:02Z

@yjshen thanks, this is milestone pr for memory controller in datafusion.

alamb

Thank you so much @yjshen

I think the memory manager interface in this PR (MemoryManager / MemoryConsumer) is a nice good foundation going forward.

Prior to merging this PR I would like to see:

The ref count cycle between the memory manager and execution plans, I think this PR could me merged into DataFusion as is and we could iterate from there
Some tests for MemoryManager and ExternalSorter (as suggested in the PR comments here)

I also think it is worth removing / reconsidering the background loop for tracked memory consumers as well, though since there isn't used yet I don't think it is critical to remove prior to merging this PR

But again, really nice and thank you for this contribution

ballista/rust/executor/src/collect.rs

datafusion/src/error.rs

datafusion/src/execution/memory_manager.rs

datafusion/src/execution/runtime_env.rs

alamb · 2022-01-09T14:01:13Z

datafusion/src/execution/memory_manager.rs

+    }
+
+    /// Register a new memory consumer for memory usage tracking
+    pub(crate) fn register_consumer(self: &Arc<Self>, consumer: Arc<dyn MemoryConsumer>) {


I didn't see any code that registered any Tracking consumers yet.

In terms of plumbing, what do you think about:

making all ExecutionPlans MemoryConsumers and providing default implementations (that reported 0 usage)

Registering all ExecutionPlans somehow as MemoryConsumers as part of physical plan creation?

That way all implementations of ExecutionPlan could report their usage without having to explicitly register themselves with the memory manager. Also the manager could report on how many operators were not providing any statistics, etc

datafusion/src/execution/memory_manager.rs

yjshen · 2022-01-10T09:35:02Z

@houqp @alamb Thanks for your detailed and insightful review!

Resolved:

The maintained total trackers' memory and the background threads that update it are removed. Instead, total tracker memory is collected each time the memory manager runs its can_grow method.
Renamed controlling consumer to requesting consumer, requester in short.
Use Weak in MemoryManager now.
Use TempDir instead of manually retries of creating scratch dirs. Use rand instead of uuid crate.

To discuss:

I didn't see any code that registered any Tracking consumers yet.

There is one in SortMergingStream while merging multiple partial order results from spill files and the last piece of batches that are still in memory. The last piece is created as in-memory batches backed StreamWrapper, and just reporting its usage as in-memory batches total size.

In terms of plumbing, what do you think about:

making all ExecutionPlans MemoryConsumers and providing default implementations (that reported 0 usage)

Registering all ExecutionPlans somehow as MemoryConsumers as part of physical plan creation?

That way all implementations of ExecutionPlan could report their usage without having to explicitly register themselves with the memory manager. Also the manager could report on how many operators were not providing any statistics, etc

I think there is a gap between ExecPlan and MemoryConsumer. Since an execute method would be called multiple times with different partition, it's always the SendableRecordBatchStream such as SortPreservingMergeStream, CrossJoinStream that takes up memory. Should I make it like:

/// Trait for types that stream [arrow::record_batch::RecordBatch]
pub trait RecordBatchStream: Stream<Item = ArrowResult<RecordBatch>> + MemoryConsumer {
    /// Returns the schema of this `RecordBatchStream`.
    ///
    /// Implementation of this trait should guarantee that all `RecordBatch`'s returned by this
    /// stream should have the same schema as returned from this method.
    fn schema(&self) -> SchemaRef;
}

/// Trait for a stream of record batches.
pub type SendableRecordBatchStream = Pin<Arc<dyn RecordBatchStream + Send + Sync>>;

Should I make SendableRecordBatchStream pin arc instead of pin box and register each stream arc to runtime at each execute() last line? Also register consumers through:

pub fn register_consumer(&self, memory_consumer: &Arc<dyn MemoryConsumer>) {

may sometimes be awkward:

runtime.register_consumer(&(streams.clone() as Arc<dyn MemoryConsumer>));

Any thoughts?

alamb · 2022-01-11T12:37:43Z

I think there is a gap between ExecPlan and MemoryConsumer. Since an execute method would be called multiple times with different partition, it's always the SendableRecordBatchStream such as SortPreservingMergeStream, CrossJoinStream that takes up memory. Should I make it like:

This is a good point (that the memory management is done on a per-partition basis rather than a per ExecutionPlan basis. I need to think 🤔 about it some more.

I would recommend we don't change SendableRecordBatchStream which is complicated enough as is.

I will make time today to review this PR again thoroughly -- thank you @yjshen I think we are close

tustvold · 2022-01-11T12:39:18Z

Should I make SendableRecordBatchStream pin arc instead of pin box and register each stream arc to runtime at each execute() last line?

Not fully caught up, but how would you consume from such a thing? You need a mutable reference to poll a stream? Streams, like iterators, are not meant to be shared.

As an aside the Sync constraint on SendableRecordBatchStream is potentially extraneous for this reason, you can't do much with a shared stream anyway, so requiring share-ability between threads imposes unnecessary implementation constraints

yjshen · 2022-01-11T13:17:13Z

@tustvold Thanks for bringing it up. I find the stream a single place to have all runtime entities be auto-registered to the memory manager at once. Maybe a wrapper over the stream could achieve the goal?

tustvold · 2022-01-11T17:03:06Z

Maybe a wrapper over the stream could achieve the goal

My instinct would be to suggest having the shared ref internal to the stream implementation, instead of a wrapper. Otherwise I suspect you will run into borrow checker, pinning, and async pain. This would also avoid needing to make breaking changes to SendableRecordBatchStream?

Another thing to potentially think about is that many of the operators aren't actually streams, rather they spawn a tokio task and then return an mpsc queue. There will need to be some accounting of both data buffered in the queue, and data in the operators "task". My gut feeling is this is going to require adding some sort of RAII tracking field to RecordBatch or possibly Buffer but I'm not really sure...

alamb

I reviewed the memory manager changes -- I think it is good enough to start with and we can iterate from there.

I didn't get a chance to fully review the changes to sort_preserving_merge -- will keep at it tomorrow.

cc @tustvold

datafusion/src/execution/memory_manager.rs

alamb · 2022-01-11T22:54:03Z

datafusion/src/physical_plan/sorts/in_mem_sort.rs

+/// Merge buffered, self-sorted record batches to get an order.
+///
+/// Internally, it uses MinHeap to reduce extra memory consumption
+/// by not concatenating all batches into one and sorting it as done by `SortExec`.


I need to study the connection between SortExec, SortPreservingMergeStream and InMemSortStream some more to fully get understand this. I can't help by think that InMemSortStream is doing the same thing as SortPreservingMergeStream -- and I wonder if we can reuse that same code

The main difference between InMemSortStream and SortPreservingMergeStream lies in assuming different numbers of "entity" (batches for IMSS and streams for SPMS) merged.

Since InMemSort means to merge much more partially ordered "entities", the sorter should reduce the num of comparison for each item pop-up. Hence a MinHeap was introduced.

On the other hand, InMemSort is more specialized to have each "entity" only one record batch, therefore simplified logic compared to consider stream continuation in SortPreservingMergeStream.

Currently, the common parts SortKeyCursor and RowIndex for both sorts are extracted to sorts/mod.rs reduce duplication.

houqp

I also don't have a good solution to the SendableRecordBatchStream problem off the top of my head, will need to think more about it. Other than that, I think the change looks good as a first iteration 👍

datafusion/src/execution/memory_manager.rs

alamb

Thank you again @yjshen for this great contribution.

I have some concerns about parts of this PR (listed below), but I still think we should merge this PR as is and handle the concerns as follow ons.

Thus my plan is to send a note to the dev mailing list and slack channel asking for comments (on the API specifically) and if we don't hear any major concerns I suggest we merge this PR tomorrow.

My rationale for the merge with concerns doing is:

This PR gets the necessary foundations in for limiting resources at runtime: specifically the RuntimeEnv, and MemoryConsumer, MemoryManager, and DiskManager APIs.
It is backwards compatible (e.g. external sort is not connected to anything, so there should be no performance regressions)

I think it would be good to file a follow on PR marking the MemoryManager, DiskManager, and MemoryConsumer APIs as experimental and I will prepare such a PR.

My Concerns (aka major follow on work):

External sorting is not connected to anything -- aka it is code that isn't used (yet)
InMemSortStream and SortPreservingMergeStream are doing very much the same thing -- consolidating the code I think will be important as we move to optimize them
As most of the rest of the system isn't connected to the memory system, the APIs may not be fully adequate (but we can iterate on that)

This PR also unlocks some cool follow on projects (like supporting external group by / group by hash / spill to disk) 🚗

alamb · 2022-01-12T14:11:34Z

FWIW I also plan to run the TPCH benchmarks on this PR and will post the results (I don't expect any changes)

xudong963

Thanks, @yjshen, great work! I went through it briefly. BTW, some of the details are well handled, such as pool size.

datafusion/src/execution/runtime_env.rs

xudong963 · 2022-01-12T15:30:56Z

datafusion/src/execution/runtime_env.rs

+        let path = tmp_dir.path().to_str().unwrap().to_string();
+        std::mem::forget(tmp_dir);
+
+        Self {


It is better to define constants for default values of batch_size and memory_fraction than bare numbers

I think this is already the default of config and meant to be overwritten if needed. And a single place for these defaults?

BTW, I think it is worth reconsidering and restructuring the multiple configs and their usages. ExecutionConfig, PhysicalPlanConfig and RuntimeConfig.

At least we should not pass target_batch_size during query planning since we already have runtimeEnv plumbing through the execute() API now, will create follow-up PR once we've merged this one.

Make sense to me. We can deal with it uniformly in the following PR

liukun4515 · 2022-01-13T04:14:37Z

FWIW I also plan to run the TPCH benchmarks on this PR and will post the results (I don't expect any changes)

Will you post the result in this review?

datafusion/src/execution/mod.rs

alamb · 2022-01-13T17:23:23Z

Will you post the result in this review?

Yes, I will do so.

alamb · 2022-01-13T17:23:49Z

Ok, I am going to fire up my benchmark machine, get some numbers, and assuming they look good merge this PR

alamb · 2022-01-13T19:29:51Z

Here are the results of my comparison: the simple_mm branch appears to be about 10% faster for reasons I don't understand

Setup:

10G TPCH data (Scale Factor 10)
16 core / 64G mem machine in google cloud ("Cascade Lake" architecture)
Ran q1 which is a basic select / predicate / orderby (query below)

Benchmark command:

cd benchmarks
cargo run --release  --bin tpch -- benchmark datafusion --partitions 16 -m  --iterations 10 --path /data/tpch_data_10G/ --format tbl --query 1

`master`

Compared master at 14176ff (arrow-datafusion) (merge base of simple_mm)

Query 1 iteration 0 took 550.8 ms
Query 1 iteration 1 took 542.1 ms
Query 1 iteration 2 took 533.0 ms
Query 1 iteration 3 took 539.4 ms
Query 1 iteration 4 took 543.0 ms
Query 1 iteration 5 took 538.5 ms
Query 1 iteration 6 took 537.9 ms
Query 1 iteration 7 took 536.6 ms
Query 1 iteration 8 took 537.5 ms
Query 1 iteration 9 took 539.8 ms
Query 1 avg time: 539.86 ms

`yjshen/simple_mm`

yjshen/simple_mm at 04dca98

Query 1 iteration 0 took 500.2 ms
Query 1 iteration 1 took 492.2 ms
Query 1 iteration 2 took 489.1 ms
Query 1 iteration 3 took 488.4 ms
Query 1 iteration 4 took 489.2 ms
Query 1 iteration 5 took 485.6 ms
Query 1 iteration 6 took 488.2 ms
Query 1 iteration 7 took 489.5 ms
Query 1 iteration 8 took 491.4 ms
Query 1 iteration 9 took 489.7 ms
Query 1 avg time: 490.36 ms

Query

Query 1

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-09-02'
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;

alamb · 2022-01-13T21:49:42Z

I think while not perfect this PR is a step in the right direction towards being able to handle queries that need to spill. 🚀 thank you @yjshen

Shall I file follow on tickets for the next step?

I am particularly interested in ensuring we consolidate the Sort code (so there is only a single sort operator that does in memory sorting if it has enough memory budget but then spills to disk if needed). I would enjoy helping make this happen (perhaps by writing some tests?)

liukun4515 · 2022-01-14T01:53:47Z

I talked with @yjshen yesterday, maybe the external sort has been implemented as a draft in his branch. @alamb
You can fill an issue to track the follow-up tasks, it's clear for all contributors.

yjshen · 2022-01-14T03:26:51Z

Thank you all again for helping me with the initial document proposal as well as insightful reviews in this PR ❤️

Shall I file follow on tickets for the next step?

Yes, that would be great! please open issues that you have in mind. I have a bunch of ideas as follow-ups as well. I think we already have the foundation for many exciting features to come. How do you like to open an umbrella issue as well as sub-task issues? I could file sub-tasks issues under it as well.

I am particularly interested in ensuring we consolidate the Sort code. (so there is only a single sort operator that does in memory sorting if it has enough memory budget but then spills to disk if needed). I would enjoy helping make this happen (perhaps by writing some tests?)

Thanks, I can do the initial consolidation, please join at any time or just take it over. depends on your time schedule.

yjshen added 4 commits January 5, 2022 16:32

Simplified memory management

378f848

External sorter as an example

96722d4

document more

5002525

use std::sync::Mutex instead of futures::lock::Mutex

8e39a6f

github-actions bot added ballista datafusion Changes in the datafusion crate labels Jan 7, 2022

houqp added the enhancement New feature or request label Jan 8, 2022

houqp reviewed Jan 8, 2022

View reviewed changes

alamb reviewed Jan 9, 2022

View reviewed changes

resolve comments

07e107c

yjshen requested review from houqp and alamb January 10, 2022 09:59

yjshen added 2 commits January 10, 2022 20:03

Adding tests for ExternalSorter as well as DiskManager

476e58b

Adding test for memory manager

a84f329

Prevent allocate more memory than we actually have

921874a

alamb added the api change Changes the API exposed to users of the crate label Jan 11, 2022

alamb reviewed Jan 11, 2022

View reviewed changes

unnecessary unsafe

234c565

yjshen force-pushed the simple_mm branch from 98e6980 to 234c565 Compare January 12, 2022 03:10

fix lint

0fc85ff

houqp approved these changes Jan 12, 2022

View reviewed changes

datafusion/src/execution/memory_manager.rs Outdated Show resolved Hide resolved

Fix requesters_total update non-atomic

dcb4aa7

alamb approved these changes Jan 12, 2022

View reviewed changes

alamb changed the title ~~A simplified memory manager for query execution~~ Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation Jan 12, 2022

xudong963 reviewed Jan 12, 2022

View reviewed changes

yjshen added 2 commits January 13, 2022 10:29

Merge remote-tracking branch 'apache/master' into simple_mm

54da646

Fix lint

3d6c865

liukun4515 reviewed Jan 13, 2022

View reviewed changes

datafusion/src/execution/mod.rs Outdated Show resolved Hide resolved

xudong963 approved these changes Jan 13, 2022

View reviewed changes

resolve comments

04dca98

alamb merged commit d7e465a into apache:master Jan 13, 2022

yjshen mentioned this pull request Jan 14, 2022

Consolidate various configurations options, remove unrelated batch_size #1565

Closed

tustvold mentioned this pull request Jan 21, 2022

Provide RuntimeEnv to ExecutionContext #1636

Closed

alamb removed the api change Changes the API exposed to users of the crate label Feb 10, 2022

yjshen mentioned this pull request Mar 5, 2022

Add timeout to can_grow_directly when waiting for the Condvar. #1921

Closed

tustvold mentioned this pull request Apr 12, 2022

RFC: Spill-To-Disk Object Storage Download #2205

Closed

Kontinuation mentioned this pull request Sep 20, 2024

A more comprehensive tuning guide for memory related options apache/datafusion-comet#949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526

Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526

yjshen commented Jan 7, 2022 •

edited

Loading

alamb commented Jan 7, 2022

houqp left a comment

houqp Jan 8, 2022

yjshen Jan 8, 2022

houqp Jan 8, 2022

yjshen Jan 9, 2022

alamb Jan 9, 2022

yjshen Jan 10, 2022

liukun4515 commented Jan 8, 2022

alamb left a comment

alamb Jan 9, 2022

yjshen commented Jan 10, 2022 •

edited

Loading

alamb commented Jan 11, 2022

tustvold commented Jan 11, 2022 •

edited

Loading

yjshen commented Jan 11, 2022

tustvold commented Jan 11, 2022 •

edited

Loading

alamb left a comment

alamb Jan 11, 2022

yjshen Jan 12, 2022

houqp left a comment

alamb left a comment

alamb commented Jan 12, 2022

xudong963 left a comment •

edited

Loading

xudong963 Jan 12, 2022

yjshen Jan 13, 2022

xudong963 Jan 13, 2022

liukun4515 commented Jan 13, 2022

alamb commented Jan 13, 2022

alamb commented Jan 13, 2022

alamb commented Jan 13, 2022

alamb commented Jan 13, 2022

liukun4515 commented Jan 14, 2022

yjshen commented Jan 14, 2022

Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526

Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526

Conversation

yjshen commented Jan 7, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Note

alamb commented Jan 7, 2022

houqp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Jan 8, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yjshen commented Jan 10, 2022 • edited Loading

Resolved:

To discuss:

alamb commented Jan 11, 2022

tustvold commented Jan 11, 2022 • edited Loading

yjshen commented Jan 11, 2022

tustvold commented Jan 11, 2022 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 12, 2022

xudong963 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Jan 13, 2022

alamb commented Jan 13, 2022

alamb commented Jan 13, 2022

alamb commented Jan 13, 2022

master

yjshen/simple_mm

Query

alamb commented Jan 13, 2022

liukun4515 commented Jan 14, 2022

yjshen commented Jan 14, 2022

yjshen commented Jan 7, 2022 •

edited

Loading

yjshen commented Jan 10, 2022 •

edited

Loading

tustvold commented Jan 11, 2022 •

edited

Loading

tustvold commented Jan 11, 2022 •

edited

Loading

xudong963 left a comment •

edited

Loading

`master`

`yjshen/simple_mm`