Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design doc for memory tracking in the plugin #2628

Open
wants to merge 2 commits into
base: branch-25.02
Choose a base branch
from

Conversation

jihoonson
Copy link
Collaborator

@jihoonson jihoonson commented Nov 26, 2024

This PR adds a new doc explaining the design of the memory tracking system in the plugin. The new doc is based on the existing doc written by @revans2, which is no longer publicly accessible, and fixed to be up-to-date. The main purpose of this PR is to have a pointer to the design doc in the code as here, so that people who are interested in this area can easily access to the design doc.

@jihoonson jihoonson force-pushed the memory-management-doc branch from 2ed37f0 to 74febef Compare November 26, 2024 21:44
@jihoonson jihoonson requested a review from revans2 November 26, 2024 21:56
Comment on lines 4 to 7
For memory management, the plugin tracks every device memory allocation and de-allocation request during processing.
While there is enough memory available, the allocation request succeeds and the task continues processing.
However, when the allocation request cannot succeed due to lack of memory, the plugin pauses that thread. When all of the active tasks have at least one thread paused, the plugin starts to roll back some of those paused threads to points where all of their input data is spillable, and let the other threads try to complete. If every thread except one has been rolled back and the one remaining thread cannot still make progress, then pluging picks up one thread to split its input and try again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer it if we reword things a bit

Suggested change
For memory management, the plugin tracks every device memory allocation and de-allocation request during processing.
While there is enough memory available, the allocation request succeeds and the task continues processing.
However, when the allocation request cannot succeed due to lack of memory, the plugin pauses that thread. When all of the active tasks have at least one thread paused, the plugin starts to roll back some of those paused threads to points where all of their input data is spillable, and let the other threads try to complete. If every thread except one has been rolled back and the one remaining thread cannot still make progress, then pluging picks up one thread to split its input and try again.
For memory management, the plugin uses RMM and wraps it to provide the ability to recover from out of memory errors. The first line of defense is spilling which is provided in the spark-rapids plugin itself. The second line of defense is described in this document and is implemented in [SparkResourceAdaptorJni.cpp](../src/main/cpp/src/SparkResourceAdaptorJni.cpp). This code keeps track of each task thread and tracks the state of those threads.
While RMM allocation requests succeed this will not interfere with the running threads.
However, when the allocation request fails, even after spilling, this code will try and pause or roll back threads to free up memory and allow other threads/tasks to succeed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overhauled this part. Please have another look.


The thread can have one of these states at a time:

- `UNKNOWN`: the thread has not been registered with the tracking system.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `UNKNOWN`: the thread has not been registered with the tracking system.
- `UNKNOWN`: the thread has not been registered with the tracking system. In this state the thread will not be messed with. It is here as a fail-safe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the additional comment is necessary. Is it not clear that the system will do nothing with the unregistered threads?

docs/memory_management.md Outdated Show resolved Hide resolved
docs/memory_management.md Outdated Show resolved Hide resolved
docs/memory_management.md Outdated Show resolved Hide resolved
docs/memory_management.md Outdated Show resolved Hide resolved
- `THREAD_BLOCKED`: the allocation is blocked due to lack of memory. The thread is waiting for enough memory to be available.
- `THREAD_BUFN_THROW`: a deadlock has been detected as all threads are blocked, and this thread has been selected to roll back to the point where all its data is spillable.
- `THREAD_BUFN_WAIT`: the thread has initiated the rollback.
- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory.
- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once another task completes. In this case completes may mean that it releases the GPU semaphore instead of fully completing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether I understand the code correctly for when the state transitions from THREAD_BUFN. AFAIT, the state transitions from THREAD_BUFN to THREAD_RUNNING after a free is called (do_deallocate() -> dealloc_core() -> wake_next_highest_priority_blocked()). Another place where the state is transitioned from THREAD_BUFN is pool_thread_finished_for_tasks(), which is called when a data receive is completed during shuffle. I don't seem to see that releasing the semaphore directly triggers unblocking a task. Can you give me some pointers where it happens in the code?

Another question for this state: why is wake_next_highest_priority_blocked() called in post_alloc_success_core()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about your comment, I think this part should be better to just explain what each state is. Perhaps I should add another section to explain when the state transition happens from one to another? I don't think every transition is worth to explain, such as THREAD_RUNNING -> UNKNOWN, so we can probably explain only those important ones.

- `THREAD_BUFN_THROW`: a deadlock has been detected as all threads are blocked, and this thread has been selected to roll back to the point where all its data is spillable.
- `THREAD_BUFN_WAIT`: the thread has initiated the rollback.
- `THREAD_BUFN`: the thread has rolled back and is now blocked until further notice (BUFN). The task will be unblocked once high priority tasks release enough memory.
- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry.
- `THREAD_SPLIT_THROW`: a deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry. Not all code is guaranteed to support splitting its input to try again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased as

A deadlock has been detected as all threads are BUFN, and this thread has been selected to roll back, split its input, and retry. Note that the processing will fail without retrying if the input cannot be further split.

docs/memory_management.md Outdated Show resolved Hide resolved
docs/memory_management.md Outdated Show resolved Hide resolved
@jihoonson jihoonson requested a review from revans2 December 5, 2024 18:50
@jihoonson
Copy link
Collaborator Author

@revans2 would you have another look?

@jihoonson jihoonson changed the base branch from branch-24.12 to branch-25.02 December 12, 2024 17:18
@jihoonson jihoonson marked this pull request as ready for review December 12, 2024 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants