Potential group\collective life-time management issue in profiler plugin. #1569

wiryls · 2025-01-09T05:24:25Z

Thank you for your excellent work on version 2.23.4, which has made the profiler plugin available. However, I have encountered some issues while testing the example profiler plugin, specifically related to the handling of group (or collective) records when NET (both IB and ETH) is not utilized.

The case is that when we don't use NET, the group (or collective) record stops after 16 outputs (leak happens).

nccl/ext-profiler/example/plugin.c

Line 25 in dcdc67c

static const int defaultGroupPoolSize = 16;

As code shows, the only time group released is when its ref-count goes to 0. And ref-count is modified only when children operations get released in updateEvent:

nccl/ext-profiler/example/plugin.c

Lines 394 to 404 in dcdc67c

    
           void updateEvent(void* handle) { 
        
             uint8_t type = *(uint8_t *)handle; 
        
             if (type == ncclProfileGroup) { 
        
               struct group* event = (struct group *)handle; 
        
               if (__atomic_sub_fetch(&event->refCount, 1, __ATOMIC_RELAXED) == 0) { 
        
                 event->stopTs = gettime() - startTime; 
        
                 // return group event to the pool 
        
                 __atomic_fetch_add(&event->ctx->groupPoolBase, 1, __ATOMIC_RELAXED); 
        
               } 
        
               debugEvent(event, "GroupStop"); 
        
             } else if (type == ncclProfileColl) {

So if there is no child operation, collective and group won't be released.

I understand that operations may not have been created by the time the group ends, so the life-time management here can get really weird. One way around this is to turn the event handle into an ID, decoupling the dependency between operations and collectives. This way, however, requires adding a void* context parameter into interface such as stopEvent.

Or is there a better way to avoid groups or collectives not releasing correctly?

The text was updated successfully, but these errors were encountered:

gcongiu · 2025-01-09T10:05:09Z

As you noted, the example plugin uses children events (associated to proxy ops) to decide when the collective event can be released, as completion of network communication gives an indication about completion of the collective. This is because for NCCL is impossible to precisely inform the profiler when the collective completes (this would require a stream synchronization and would not be accurate for group operations, where multiple collectives are executed by the same kernel). Instead NCCL stops the collective event after the collective has been launched.

There are at least a couple of ways to fix the leak of events you noted:

Unconditionally recycle collective events after the event window is exhausted. This means that the profiler could add a unique id (tag) to the event handle and reflect that in the collective event as well. Every time an update for a children event happens the id in the handle is checked against the id in the event (I think this is similar to what you were suggesting?). If the id in the handle and the event don't match, it means the update refers to a previous collective that fell out the window and should be dropped. This can be implemented by a third party plugin without any changes in NCCL.
Rely on children events that are always present, even when there is no network communication. This can be done by adding kernel events and requires profiler plugin interface changes and additional infrastructure inside NCCL. This is something we are currently working on.
A combination of 1 and 2.

wiryls · 2025-01-09T11:49:56Z

Unconditionally recycle collective events after the event window is exhausted. This means that the profiler could add a unique id (tag) to the event handle and reflect that in the collective event as well. Every time an update for a children event happens the id in the handle is checked against the id in the event (I think this is similar to what you were suggesting?). If the id in the handle and the event don't match, it means the update refers to a previous collective that fell out the window and should be dropped. This can be implemented by a third party plugin without any changes in NCCL.

Thanks for the answer! But it seems current child-release-parent approach cannot handle the situation when there is no child. This (no child operation) can be found in some simple cases such as using only SHM communication.

Regarding the timing of the release, what I mean is that the collective gets released at its own end event. Later, the child operation will get a monotonically increasing id of the parent collective instead of the pointer void* parentObj. This way, the life-time of operation and collective will be completely decoupled. But of course, this also brings new problems, such as the operation no longer has a direct nest relationship with the collective, and the original chrome trace style printing no longer works.

Rely on children events that are always present, even when there is no network communication. This can be done by adding kernel events and requires profiler plugin interface changes and additional infrastructure inside NCCL. This is something we are currently working on.

Glad to hear the issue is being fixed. It sounds like my complicated approach is no longer needed. Am looking forward to your next release!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential group\collective life-time management issue in profiler plugin. #1569

Potential group\collective life-time management issue in profiler plugin. #1569

wiryls commented Jan 9, 2025 •

edited

Loading

gcongiu commented Jan 9, 2025

wiryls commented Jan 9, 2025

Potential group\collective life-time management issue in profiler plugin. #1569

Potential group\collective life-time management issue in profiler plugin. #1569

Comments

wiryls commented Jan 9, 2025 • edited Loading

gcongiu commented Jan 9, 2025

wiryls commented Jan 9, 2025

wiryls commented Jan 9, 2025 •

edited

Loading