Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is implementation of per pod gpu monitoring from #1
Example in example/kubernetes assumes rdc already contains rdc_prometheus_py patch (of course you can just ADD prepatched rdc_prometheus_py to Dockerfile if you want to test it right now).
You need to build container image and push it to your container image repository and modify some things in rdc.yaml file: location of both container images, nodeSelector (to match label of worker nodes that contain AMD GPUs) and podresources-api's volume location - in my case it was on a host machine.
Rdc and rdc_prometheus.py don't have to be inside of kubernetes to make it work - it is just easier that way to make an example.
Tested and works in production on kubernetes 1.21. Example output:
https://gist.github.com/boniek83/7eaefe7f46edad1ef28046118c354c17