-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design doc for memory tracking in the plugin #2628
base: branch-25.02
Are you sure you want to change the base?
Conversation
Signed-off-by: Jihoon Son <[email protected]>
2ed37f0
to
74febef
Compare
docs/memory_management.md
Outdated
For memory management, the plugin tracks every device memory allocation and de-allocation request during processing. | ||
While there is enough memory available, the allocation request succeeds and the task continues processing. | ||
However, when the allocation request cannot succeed due to lack of memory, the plugin pauses that thread. When all of the active tasks have at least one thread paused, the plugin starts to roll back some of those paused threads to points where all of their input data is spillable, and let the other threads try to complete. If every thread except one has been rolled back and the one remaining thread cannot still make progress, then pluging picks up one thread to split its input and try again. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer it if we reword things a bit
For memory management, the plugin tracks every device memory allocation and de-allocation request during processing. | |
While there is enough memory available, the allocation request succeeds and the task continues processing. | |
However, when the allocation request cannot succeed due to lack of memory, the plugin pauses that thread. When all of the active tasks have at least one thread paused, the plugin starts to roll back some of those paused threads to points where all of their input data is spillable, and let the other threads try to complete. If every thread except one has been rolled back and the one remaining thread cannot still make progress, then pluging picks up one thread to split its input and try again. | |
For memory management, the plugin uses RMM and wraps it to provide the ability to recover from out of memory errors. The first line of defense is spilling which is provided in the spark-rapids plugin itself. The second line of defense is described in this document and is implemented in [SparkResourceAdaptorJni.cpp](../src/main/cpp/src/SparkResourceAdaptorJni.cpp). This code keeps track of each task thread and tracks the state of those threads. | |
While RMM allocation requests succeed this will not interfere with the running threads. | |
However, when the allocation request fails, even after spilling, this code will try and pause or roll back threads to free up memory and allow other threads/tasks to succeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I overhauled this part. Please have another look.
docs/memory_management.md
Outdated
|
||
The thread can have one of these states at a time: | ||
|
||
- `UNKNOWN`: the thread has not been registered with the tracking system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `UNKNOWN`: the thread has not been registered with the tracking system. | |
- `UNKNOWN`: the thread has not been registered with the tracking system. In this state the thread will not be messed with. It is here as a fail-safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the additional comment is necessary. Is it not clear that the system will do nothing with the unregistered threads?
docs/memory_management.md
Outdated
- `THREAD_BLOCKED`: the allocation is blocked due to lack of memory. The thread is waiting for enough memory to be available. | ||
- `THREAD_BUFN_THROW`: a deadlock has been detected as all threads are blocked, and this thread has been selected to roll back to the point where all its data is spillable. | ||
- `THREAD_BUFN_WAIT`: the thread has initiated the rollback. | ||
- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory. | |
- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once another task completes. In this case completes may mean that it releases the GPU semaphore instead of fully completing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether I understand the code correctly for when the state transitions from THREAD_BUFN
. AFAIT, the state transitions from THREAD_BUFN
to THREAD_RUNNING
after a free is called (do_deallocate()
-> dealloc_core()
-> wake_next_highest_priority_blocked()
). Another place where the state is transitioned from THREAD_BUFN
is pool_thread_finished_for_tasks()
, which is called when a data receive is completed during shuffle. I don't seem to see that releasing the semaphore directly triggers unblocking a task. Can you give me some pointers where it happens in the code?
Another question for this state: why is wake_next_highest_priority_blocked()
called in post_alloc_success_core()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about your comment, I think this part should be better to just explain what each state is. Perhaps I should add another section to explain when the state transition happens from one to another? I don't think every transition is worth to explain, such as THREAD_RUNNING
-> UNKNOWN
, so we can probably explain only those important ones.
docs/memory_management.md
Outdated
- `THREAD_BUFN_THROW`: a deadlock has been detected as all threads are blocked, and this thread has been selected to roll back to the point where all its data is spillable. | ||
- `THREAD_BUFN_WAIT`: the thread has initiated the rollback. | ||
- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory. | ||
- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry. | |
- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry. Not all code is guaranteed to support splitting its input to try again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased as
A deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry. Note that the processing will fail without retrying if the input cannot be further split.
@revans2 would you have another look? |
This PR adds a new doc explaining the design of the memory tracking system in the plugin. The new doc is based on the existing doc written by @revans2, which is no longer publicly accessible, and fixed to be up-to-date. The main purpose of this PR is to have a pointer to the design doc in the code as here, so that people who are interested in this area can easily access to the design doc.