Skip to content

Commit

Permalink
feat: expose function for listening to policy violations on a specifi…
Browse files Browse the repository at this point in the history
…c GPU group (#73)

== Motivation ==

Enable finer grained GPU policy violation tracking

== Details ==

The current go-dcgm library exposes a way to listen to policy violations across
all GPUs. While this is useful, it does not enable users to understand exactly
which GPUs are experiencing issues. Ideally, users would also be able to listen
to policy violations on specific groups which could be created on a per-gpu basis.
This would allow users to then know when specific GPUs were experiencing issues.

This change exposes a new function, ListenForPolicyViolationsForGroup, which takes a
GroupHandle passed by the user and listens to policy violations for that group. It
also modifies ListenForPolicyViolations to use this new function, but with specifying
the group for all GPUs — so no net change in behavior.

Co-authored-by: sanjams <[email protected]>
  • Loading branch information
sanjams2 and sanjams2 authored Sep 10, 2024
1 parent f83cdef commit 85ceb31
Showing 1 changed file with 7 additions and 2 deletions.
9 changes: 7 additions & 2 deletions pkg/dcgm/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -104,10 +104,15 @@ func HealthCheckByGpuId(gpuId uint) (DeviceHealth, error) {
return healthCheckByGpuId(gpuId)
}

// ListenForPolicyViolations sets GPU usage and error policies and notifies in case of any violations
// ListenForPolicyViolations sets GPU usage and error policies and notifies in case of any violations on all GPUs
func ListenForPolicyViolations(ctx context.Context, typ ...policyCondition) (<-chan PolicyViolation, error) {
groupId := GroupAllGPUs()
return registerPolicy(ctx, groupId, typ...)
return ListenForPolicyViolationsForGroup(ctx, groupId, typ...)
}

// ListenForPolicyViolations sets GPU usage and error policies and notifies in case of any violations on GPUs within a specific group
func ListenForPolicyViolationsForGroup(ctx context.Context, group GroupHandle, typ ...policyCondition) (<-chan PolicyViolation, error) {
return registerPolicy(ctx, group, typ...)
}

// Introspect returns DCGM hostengine memory and CPU usage
Expand Down

0 comments on commit 85ceb31

Please sign in to comment.