Add design doc for memory tracking in the plugin #2628

jihoonson · 2024-11-26T21:42:58Z

This PR adds a new doc explaining the design of the memory tracking system in the plugin. The new doc is based on the existing doc written by @revans2, which is no longer publicly accessible, and fixed to be up-to-date. The main purpose of this PR is to have a pointer to the design doc in the code as here, so that people who are interested in this area can easily access to the design doc.

Signed-off-by: Jihoon Son <[email protected]>

revans2 · 2024-12-02T19:48:10Z

docs/memory_management.md

+For memory management, the plugin tracks every device memory allocation and de-allocation request during processing.
+While there is enough memory available, the allocation request succeeds and the task continues processing.
+However, when the allocation request cannot succeed due to lack of memory, the plugin pauses that thread. When all of the active tasks have at least one thread paused, the plugin starts to roll back some of those paused threads to points where all of their input data is spillable, and let the other threads try to complete. If every thread except one has been rolled back and the one remaining thread cannot still make progress, then pluging picks up one thread to split its input and try again.
+


I would prefer it if we reword things a bit

Suggested change

For memory management, the plugin tracks every device memory allocation and de-allocation request during processing.

While there is enough memory available, the allocation request succeeds and the task continues processing.

However, when the allocation request cannot succeed due to lack of memory, the plugin pauses that thread. When all of the active tasks have at least one thread paused, the plugin starts to roll back some of those paused threads to points where all of their input data is spillable, and let the other threads try to complete. If every thread except one has been rolled back and the one remaining thread cannot still make progress, then pluging picks up one thread to split its input and try again.

For memory management, the plugin uses RMM and wraps it to provide the ability to recover from out of memory errors. The first line of defense is spilling which is provided in the spark-rapids plugin itself. The second line of defense is described in this document and is implemented in [SparkResourceAdaptorJni.cpp](../src/main/cpp/src/SparkResourceAdaptorJni.cpp). This code keeps track of each task thread and tracks the state of those threads.

While RMM allocation requests succeed this will not interfere with the running threads.

However, when the allocation request fails, even after spilling, this code will try and pause or roll back threads to free up memory and allow other threads/tasks to succeed.

I overhauled this part. Please have another look.

revans2 · 2024-12-02T19:49:03Z

docs/memory_management.md

+
+The thread can have one of these states at a time:
+
+- `UNKNOWN`: the thread has not been registered with the tracking system.


Suggested change

- `UNKNOWN`: the thread has not been registered with the tracking system.

- `UNKNOWN`: the thread has not been registered with the tracking system. In this state the thread will not be messed with. It is here as a fail-safe.

I'm not sure if the additional comment is necessary. Is it not clear that the system will do nothing with the unregistered threads?

docs/memory_management.md

revans2 · 2024-12-02T20:07:04Z

docs/memory_management.md

+- `THREAD_BLOCKED`: the allocation is blocked due to lack of memory. The thread is waiting for enough memory to be available.
+- `THREAD_BUFN_THROW`: a deadlock has been detected as all threads are blocked, and this thread has been selected to roll back to the point where all its data is spillable.
+- `THREAD_BUFN_WAIT`: the thread has initiated the rollback.
+- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory.


Suggested change

- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory.

- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once another task completes. In this case completes may mean that it releases the GPU semaphore instead of fully completing.

I'm not sure whether I understand the code correctly for when the state transitions from THREAD_BUFN. AFAIT, the state transitions from THREAD_BUFN to THREAD_RUNNING after a free is called (do_deallocate() -> dealloc_core() -> wake_next_highest_priority_blocked()). Another place where the state is transitioned from THREAD_BUFN is pool_thread_finished_for_tasks(), which is called when a data receive is completed during shuffle. I don't seem to see that releasing the semaphore directly triggers unblocking a task. Can you give me some pointers where it happens in the code?

Another question for this state: why is wake_next_highest_priority_blocked() called in post_alloc_success_core()?

After thinking about your comment, I think this part should be better to just explain what each state is. Perhaps I should add another section to explain when the state transition happens from one to another? I don't think every transition is worth to explain, such as THREAD_RUNNING -> UNKNOWN, so we can probably explain only those important ones.

revans2 · 2024-12-02T20:08:18Z

docs/memory_management.md

+- `THREAD_BUFN_THROW`: a deadlock has been detected as all threads are blocked, and this thread has been selected to roll back to the point where all its data is spillable.
+- `THREAD_BUFN_WAIT`: the thread has initiated the rollback.
+- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory.
+- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry.


Suggested change

- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry.

- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry. Not all code is guaranteed to support splitting its input to try again.

Rephrased as

A deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry. Note that the processing will fail without retrying if the input cannot be further split.

docs/memory_management.md

jihoonson · 2024-12-10T19:27:45Z

@revans2 would you have another look?

Add doc for memory management

74febef

Signed-off-by: Jihoon Son <[email protected]>

jihoonson force-pushed the memory-management-doc branch from 2ed37f0 to 74febef Compare November 26, 2024 21:44

jihoonson requested a review from revans2 November 26, 2024 21:56

revans2 reviewed Dec 2, 2024

View reviewed changes

address comments and revise

d61032c

jihoonson requested a review from revans2 December 5, 2024 18:50

jihoonson changed the base branch from branch-24.12 to branch-25.02 December 12, 2024 17:18

jihoonson marked this pull request as ready for review December 12, 2024 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add design doc for memory tracking in the plugin #2628

Add design doc for memory tracking in the plugin #2628

jihoonson commented Nov 26, 2024 •

edited

Loading

revans2 Dec 2, 2024

jihoonson Dec 3, 2024

revans2 Dec 2, 2024

jihoonson Dec 3, 2024

revans2 Dec 2, 2024

jihoonson Dec 2, 2024

jihoonson Dec 3, 2024

revans2 Dec 2, 2024

jihoonson Dec 2, 2024

jihoonson commented Dec 10, 2024


		The thread can have one of these states at a time:

		- `UNKNOWN`: the thread has not been registered with the tracking system.

	- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory.
	- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once another task completes. In this case completes may mean that it releases the GPU semaphore instead of fully completing.

	- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry.
	- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry. Not all code is guaranteed to support splitting its input to try again.

Add design doc for memory tracking in the plugin #2628

Are you sure you want to change the base?

Add design doc for memory tracking in the plugin #2628

Conversation

jihoonson commented Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson commented Dec 10, 2024

jihoonson commented Nov 26, 2024 •

edited

Loading