Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential group\collective life-time management issue in profiler plugin. #1569

Open
wiryls opened this issue Jan 9, 2025 · 2 comments
Open

Comments

@wiryls
Copy link

wiryls commented Jan 9, 2025

Thank you for your excellent work on version 2.23.4, which has made the profiler plugin available. However, I have encountered some issues while testing the example profiler plugin, specifically related to the handling of group (or collective) records when NET (both IB and ETH) is not utilized.

The case is that when we don't use NET, the group (or collective) record stops after 16 outputs (leak happens).

static const int defaultGroupPoolSize = 16;

As code shows, the only time group released is when its ref-count goes to 0. And ref-count is modified only when children operations get released in updateEvent:

void updateEvent(void* handle) {
uint8_t type = *(uint8_t *)handle;
if (type == ncclProfileGroup) {
struct group* event = (struct group *)handle;
if (__atomic_sub_fetch(&event->refCount, 1, __ATOMIC_RELAXED) == 0) {
event->stopTs = gettime() - startTime;
// return group event to the pool
__atomic_fetch_add(&event->ctx->groupPoolBase, 1, __ATOMIC_RELAXED);
}
debugEvent(event, "GroupStop");
} else if (type == ncclProfileColl) {

So if there is no child operation, collective and group won't be released.

I understand that operations may not have been created by the time the group ends, so the life-time management here can get really weird. One way around this is to turn the event handle into an ID, decoupling the dependency between operations and collectives. This way, however, requires adding a void* context parameter into interface such as stopEvent.

Or is there a better way to avoid groups or collectives not releasing correctly?

@gcongiu
Copy link
Contributor

gcongiu commented Jan 9, 2025

As you noted, the example plugin uses children events (associated to proxy ops) to decide when the collective event can be released, as completion of network communication gives an indication about completion of the collective. This is because for NCCL is impossible to precisely inform the profiler when the collective completes (this would require a stream synchronization and would not be accurate for group operations, where multiple collectives are executed by the same kernel). Instead NCCL stops the collective event after the collective has been launched.

There are at least a couple of ways to fix the leak of events you noted:

  1. Unconditionally recycle collective events after the event window is exhausted. This means that the profiler could add a unique id (tag) to the event handle and reflect that in the collective event as well. Every time an update for a children event happens the id in the handle is checked against the id in the event (I think this is similar to what you were suggesting?). If the id in the handle and the event don't match, it means the update refers to a previous collective that fell out the window and should be dropped. This can be implemented by a third party plugin without any changes in NCCL.

  2. Rely on children events that are always present, even when there is no network communication. This can be done by adding kernel events and requires profiler plugin interface changes and additional infrastructure inside NCCL. This is something we are currently working on.

  3. A combination of 1 and 2.

@wiryls
Copy link
Author

wiryls commented Jan 9, 2025

  1. Unconditionally recycle collective events after the event window is exhausted. This means that the profiler could add a unique id (tag) to the event handle and reflect that in the collective event as well. Every time an update for a children event happens the id in the handle is checked against the id in the event (I think this is similar to what you were suggesting?). If the id in the handle and the event don't match, it means the update refers to a previous collective that fell out the window and should be dropped. This can be implemented by a third party plugin without any changes in NCCL.

Thanks for the answer! But it seems current child-release-parent approach cannot handle the situation when there is no child. This (no child operation) can be found in some simple cases such as using only SHM communication.

Regarding the timing of the release, what I mean is that the collective gets released at its own end event. Later, the child operation will get a monotonically increasing id of the parent collective instead of the pointer void* parentObj. This way, the life-time of operation and collective will be completely decoupled. But of course, this also brings new problems, such as the operation no longer has a direct nest relationship with the collective, and the original chrome trace style printing no longer works.

  1. Rely on children events that are always present, even when there is no network communication. This can be done by adding kernel events and requires profiler plugin interface changes and additional infrastructure inside NCCL. This is something we are currently working on.

Glad to hear the issue is being fixed. It sounds like my complicated approach is no longer needed. Am looking forward to your next release!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants