feat: expose function for listening to policy violations on a specific GPU group #73
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
== Motivation ==
Enable finer grained GPU policy violation tracking
== Details ==
The current go-dcgm library exposes a way to listen to policy violations across all GPUs. While this is useful, it does not currently help with identifying exactly which GPUs are experiencing issues. Ideally, the policy violation would contain identifying GPU information, but it seems today it does not (struct definitions). So instead, it would be useful if users could listen to policy violations on groups created for specific GPUs. This would allow users to then know when specific GPUs were experiencing issues.
This change exposes a new function,
ListenForPolicyViolationsForGroup
, which takes aGroupHandle
passed by the user and listens to policy violations for that group. It also modifiesListenForPolicyViolations
to use this new function, but with specifying the group for all GPUs — so no net change in behavior.Signed-off-by: sanjams2 [email protected]