Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for MIG and vGPUs in exporter #193

Merged
merged 4 commits into from
Oct 14, 2024
Merged

Conversation

mahendrapaipuri
Copy link
Owner

@mahendrapaipuri mahendrapaipuri commented Oct 14, 2024

  • Exporter estimates a coefficient based on relative number of SMs in MIG profile and it can be used along with dcgm-exporter to estimate power consumption of MIG instance.

  • Similarly, for the vGPU, we keep track of number of active instances scheduled on either a physical GPU or a MIG instance and estimate the coefficient which can be used to estimate power consumption of each vGPU

  • Support defining the GPU ordering for SLURM collector as the ordering can be undefined when a mix of MIG and full GPUs are used on compute node.

  • Split all GPU related functions into a separate file and add more unit tests

  • Modify mocked resources appropriately to test different scenarios in unit and e2e tests

  • Update docs and add a new section on power estimation on GPUs when MIG and vGPUs are used on compute nodes.

Closes #187

* Exporter estimates a coefficient based on relative number of SMs in MIG profile and it can be used along with dcgm-exporter to estimate power consumption of MIG instance.

* Similarly, for the vGPU, we keep track of number of active instances scheduled on either a physical GPU or a MIG instance and estimate the coefficient which can be used to estimate power consumption of each vGPU

* Support defining the GPU ordering for SLURM collector as the ordering can be undefined when a mix of MIG and full GPUs are used on compute node.

* Split all GPU related functions into a separate file and add more unit tests

* Modify mocked resources appropriately to test different scenarios in unit and e2e tests

* Update docs and add a new section on power estimation on GPUs when MIG and vGPUs are used on compute nodes.

Signed-off-by: Mahendra Paipuri <[email protected]>
@mahendrapaipuri mahendrapaipuri added the enhancement New feature or request label Oct 14, 2024
@mahendrapaipuri mahendrapaipuri merged commit 588cd9e into main Oct 14, 2024
15 checks passed
@mahendrapaipuri mahendrapaipuri deleted the support_mig_vgpu branch October 14, 2024 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Documentation details on MIG and vGPU limitations
1 participant