-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential group\collective life-time management issue in profiler plugin. #1569
Comments
As you noted, the example plugin uses children events (associated to proxy ops) to decide when the collective event can be released, as completion of network communication gives an indication about completion of the collective. This is because for NCCL is impossible to precisely inform the profiler when the collective completes (this would require a stream synchronization and would not be accurate for group operations, where multiple collectives are executed by the same kernel). Instead NCCL stops the collective event after the collective has been launched. There are at least a couple of ways to fix the leak of events you noted:
|
Thanks for the answer! But it seems current child-release-parent approach cannot handle the situation when there is no child. This (no child operation) can be found in some simple cases such as using only SHM communication. Regarding the timing of the release, what I mean is that the collective gets released at its own end event. Later, the child operation will get a monotonically increasing id of the parent collective instead of the pointer
Glad to hear the issue is being fixed. It sounds like my complicated approach is no longer needed. Am looking forward to your next release! |
Thank you for your excellent work on version 2.23.4, which has made the profiler plugin available. However, I have encountered some issues while testing the example profiler plugin, specifically related to the handling of group (or collective) records when NET (both IB and ETH) is not utilized.
The case is that when we don't use NET, the group (or collective) record stops after 16 outputs (leak happens).
nccl/ext-profiler/example/plugin.c
Line 25 in dcdc67c
As code shows, the only time group released is when its ref-count goes to 0. And ref-count is modified only when children operations get released in
updateEvent
:nccl/ext-profiler/example/plugin.c
Lines 394 to 404 in dcdc67c
So if there is no child operation, collective and group won't be released.
I understand that operations may not have been created by the time the group ends, so the life-time management here can get really weird. One way around this is to turn the event handle into an ID, decoupling the dependency between operations and collectives. This way, however, requires adding a
void* context
parameter into interface such asstopEvent
.Or is there a better way to avoid groups or collectives not releasing correctly?
The text was updated successfully, but these errors were encountered: