Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow watch pid with pre-created group. #60

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rootfs
Copy link

@rootfs rootfs commented Feb 23, 2024

Fix #59

The existing WatchPidFieldsEx API always creates device group. If the MIG is enabled, the device group doesn't capture the MIG device and may fail to watch processes there.

The PR allows a pre-created device to pass into the WatchPidFieldsWithGroup API.

I am also adding a sample test file here, but for certain reason I am not able to run it successfully:

# go run samples/processInfo/watchPidWithGroup/main.go 
2024/02/23 16:24:11 Created device group "dg-20240223162411"
2024/02/23 16:24:11 found 2 supported devices
2024/02/23 16:24:11 Adding device 0 to group "dg-20240223162411"
2024/02/23 16:24:11 Adding device 1 to group "dg-20240223162411"
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[0] GPU Device Index[0] Instance ID[13]: err <nil>
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[1] GPU Device Index[0] Instance ID[5]: err <nil>
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[2] GPU Device Index[0] Instance ID[1]: err <nil>
2024/02/23 16:24:12 DCGM pid watcher set up successfully
2024/02/23 16:24:12 DCGM initialized successfully
2024/02/23 16:24:17 failed to get process info: Bad parameter passed to function
2024/02/23 16:24:22 failed to get process info: Bad parameter passed to function

In fact, the existing sample also failed with the same error:

# go run samples/processInfo/main.go 
2024/02/23 16:23:17 Enabling DCGM watches to start collecting process stats. This may take a few seconds....
2024/02/23 16:23:20 Bad parameter passed to function
panic: Bad parameter passed to function

@rohit-arora-dev
Copy link
Collaborator

Fix #59

The existing WatchPidFieldsEx API always creates device group. If the MIG is enabled, the device group doesn't capture the MIG device and may fail to watch processes there.

The PR allows a pre-created device to pass into the WatchPidFieldsWithGroup API.

I am also adding a sample test file here, but for certain reason I am not able to run it successfully:

# go run samples/processInfo/watchPidWithGroup/main.go 
2024/02/23 16:24:11 Created device group "dg-20240223162411"
2024/02/23 16:24:11 found 2 supported devices
2024/02/23 16:24:11 Adding device 0 to group "dg-20240223162411"
2024/02/23 16:24:11 Adding device 1 to group "dg-20240223162411"
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[0] GPU Device Index[0] Instance ID[13]: err <nil>
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[1] GPU Device Index[0] Instance ID[5]: err <nil>
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[2] GPU Device Index[0] Instance ID[1]: err <nil>
2024/02/23 16:24:12 DCGM pid watcher set up successfully
2024/02/23 16:24:12 DCGM initialized successfully
2024/02/23 16:24:17 failed to get process info: Bad parameter passed to function
2024/02/23 16:24:22 failed to get process info: Bad parameter passed to function

In fact, the existing sample also failed with the same error:

# go run samples/processInfo/main.go 
2024/02/23 16:23:17 Enabling DCGM watches to start collecting process stats. This may take a few seconds....
2024/02/23 16:23:20 Bad parameter passed to function
panic: Bad parameter passed to function

@rootfs

Thank you for your contribution.

The reason sample test is failing because the go run argument for existing sample in samples/processInfo/main.go and new sample in samples/processInfo/watchPidWithGroup/main.go expects PID to be supplied e.g. go run main.go --pid 1234. Please add a similar note like this in your sample as well.

Please share output with a successful run on a MIG device and suggested changes.

Comment on lines +139 to +141
for _, gpu := range pidInfo {
log.Printf("gpu %d, process info: %+v\n", gpu.GPU, gpu)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using template for output same as samples/processInfo/main.go for consistency.

// create a tick to watch the process every 5 seconds
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for range ticker.C {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider running it once or twice but not every 5 seconds.

if err != nil {
return fmt.Errorf("failed to find supported devices: %v", err)
}
log.Printf("found %d supported devices\n", len(supportedDeviceIndices))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
s/found/Found

as it is not an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

provide more flexible WatchPidFields API
2 participants