AMD Support #142

anilmurty · 2023-10-16T20:38:45Z

Support for AMD GPUs on Akash Network. There may not be any significant work necessary but first step is to test with an AMD GPU(s). This is very important because AMD is working on the MI 250 chipset which is expected to be a serious contender to Nvidia A100 and H100 chips. Here is a blog from MosaicML benchmarking and comparing its performance with Nvidia's chips: https://www.mosaicml.com/blog/amd-mi250

It seems like the initial work is validating whether the kubernetes device plugin for AMD can work for us (the way the Nvidia one has) https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment

Is this something that a community person can help with?

brewsterdrinkwater · 2023-10-24T17:31:24Z

Oct 24 sync:

Want to add AMD support as a follow on to Nvidia.
Anil is getting access to AMD 210 very soon. Need to make sure AMD plugin works.
Artur mentioned that nothing needs to be done with the Network. Kubernetes device installation is something to work on.
Need to update SDL parser.
Need to update provider to accept AMD as provider.

anilmurty · 2023-11-07T16:59:09Z

We have access to a cluster that includes a Nvidia L40 and an AMD 210 GPU. @andy108369 is working on testing out setting up a provider with them.

Current status: L40 works out of the box (as expected), AMD does not. Per @troian , we filter on "Nvidia" GPUs in nodes and providers. Artur needs to work on removing this filter and setting up a testnet for Andrey to test with. Removing this filtering likely shouldn't need a network upgrade

brewsterdrinkwater · 2023-12-05T18:05:01Z

December 5th, 2023:

Filter is there right now. Client needs to be updated.
Artur looking to work on that this week.
will test releases when they come out; first on Sandbox.
DOES NOT need network upgrade.

refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>

brewsterdrinkwater · 2023-12-12T19:01:36Z

December 12th, 2023

SDL part is done.
Working on provider part next
This will be tested by core team over the next couple of days

brewsterdrinkwater · 2023-12-19T18:13:19Z

December 19th, 2023

RC for provider cut this morning.
Testing is going on right now for AMD GPU support.
WIll test on SDXL app, as well.

Next Steps:

Suggested Path Forward for Validation of AMD GPU Support (can discuss further in eng sync today)
Andy and Scott will spin up providers and test the build process and deployment process.
Documentation will be created after testing is complete.

andy108369 · 2023-12-20T10:04:25Z

Updates:

provider 0.4.9-rc0 doesn't seem to register the AMD MI 210; despite MI 210 working in a K8s Pod. Details https://github.com/ovrclk/engineering/issues/810#issuecomment-1864190645

andy108369 · 2023-12-22T12:57:35Z

Test run results

🟢 provider version 0.4.9-rc0 (both provider & client): AMD GPU MI210 deployment works! (evidence in private repo atm)
🔴 provider: issue with the aggregated GPU count when mixed GPU Vendors (e.g. NVIDIA & AMD) are present on the same worker node (kubectl describe node <node-name> should only report 'nvidia.com/gpuORamd.com/gpuK8s node attribute, otherwise it will only and always see a single GPU or no GPU / Limbo [flapping between0/1 gpu count]; and you cannot easily remove K8s node attributes such as 'nvidia.com/gpu, amd.com/gpu as they get stuck in etcd [K8s DB] - the only way is to reinstantiate the node.)

Next steps:

akash-provider: address the akash-provider issue with the wrong GPU count when mixed GPU Vendors on a node are present;
security/helm-chart: see whether we can deploy that amd-gpu-helm/amd-gpu helm-chart in its own namespace instead of kube-system for better security;
security/helm-chart: make sure one cannot request more AMD GPU than he should.. refs. HIP_VISIBLE_DEVICES
/ ROCR_VISIBLE_DEVICES (similarly to how it was possible with NVIDIA GPU via NVIDIA_VISIBLE_DEVICES=all env variable, which was addressed here ) ; refs ; Update (Jan/08/2024): - raised a question GPU isolation options ROCm/k8s-device-plugin#45
usability/helm-chart: see if we can have rocm-smi tool by default in the AMD GPU Pod (just like we get nvidia-smi tool in NVIDIA GPU Pod - which is done by the nvidia device plugin by mounting the necessary host paths and is controlled by environment variables such as NVIDIA_DRIVER_CAPABILITIES - more examples/info here) ; Update (Jan/08/2024) - raised a question Libraries/binaries mounted in the container (analogous to NVIDIA_DRIVER_CAPABILITIES) ROCm/k8s-device-plugin#44
docs: document AMD GPU Akash Provider enablement (preliminary test version of the doc Docs: How to enable AMD GPU support in Akash Provider is in the private repo atm)

andy108369 · 2024-01-08T12:41:35Z

security/helm-chart: see whether we can deploy that amd-gpu-helm/amd-gpu helm-chart in its own namespace instead of kube-system for better security;

This is possible - requires --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin flags to be specified as follows:

helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0

Verification:

root@node1:~# helm install --create-namespace --namespace amd-device-plugin --set namespace=amd-device-plugin my-amd-gpu amd-gpu-helm/amd-gpu --version 0.10.0
NAME: my-amd-gpu
LAST DEPLOYED: Mon Jan  8 12:38:25 2024
NAMESPACE: amd-device-plugin
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
amd-gpu-device-plugin-daemonset deployed in namespace 'amd-device-plugin'

brewsterdrinkwater · 2024-01-16T17:17:12Z

January 16th, 2024:

Need to update documentation.

anilmurty · 2024-01-16T20:23:11Z

Additional notes: We currently have a limitation (applies to both Nvidia and AMD) where we (K8s) cannot allow mixing of models on the same node. it is fine to mix models on the provider (accross) as long as each node only has GPUs of same model.

brewsterdrinkwater · 2024-01-23T17:06:36Z

January 23rd:

documentation being worked on this week.

andy108369 · 2024-01-25T11:05:00Z

pushed the AMD GPU support doc, now available at https://docs.akash.network/other-resources/experimental/amd-gpu-support

I'll go through it once more once I get the access to the AMD GPU box.

anilmurty assigned troian and andy108369 Nov 7, 2023

anilmurty transferred this issue from akash-network/community Nov 7, 2023

troian unassigned andy108369 Dec 6, 2023

troian added repo/node Akash node repo issues repo/provider Akash provider-services repo issues repo/akash-api P1 labels Dec 6, 2023

troian added a commit to akash-network/node that referenced this issue Dec 8, 2023

feat(sdl): parse amd gpu

31734b6

refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>

troian mentioned this issue Dec 8, 2023

feat(sdl): parse amd gpu akash-network/node#1907

Closed

troian added a commit to akash-network/node that referenced this issue Dec 9, 2023

feat(sdl): parse amd gpu

120c27a

refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>

troian added a commit to akash-network/node that referenced this issue Dec 9, 2023

feat(sdl): parse amd gpu

169a1e0

refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>

troian added a commit to akash-network/node that referenced this issue Dec 9, 2023

feat(sdl): parse amd gpu

13d117b

refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>

troian added a commit to akash-network/node that referenced this issue Dec 9, 2023

feat(sdl): parse amd gpu

6fc5780

refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>

troian added a commit to akash-network/node that referenced this issue Dec 9, 2023

feat(sdl): parse amd gpu

7e1c3a6

refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>

troian added a commit to akash-network/node that referenced this issue Dec 9, 2023

feat(sdl): parse amd gpu

0ab7d5c

refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>

troian moved this from In Progress (prioritized) to In Test (or staging) in Core Product and Engineering Roadmap Jan 9, 2024

andy108369 self-assigned this Jan 24, 2024

brewsterdrinkwater moved this from In Test (or staging) to Released (in Prod) in Core Product and Engineering Roadmap Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD Support #142

AMD Support #142

anilmurty commented Oct 16, 2023 •

edited

Loading

brewsterdrinkwater commented Oct 24, 2023

anilmurty commented Nov 7, 2023 •

edited

Loading

brewsterdrinkwater commented Dec 5, 2023

brewsterdrinkwater commented Dec 12, 2023

brewsterdrinkwater commented Dec 19, 2023 •

edited

Loading

andy108369 commented Dec 20, 2023

andy108369 commented Dec 22, 2023 •

edited

Loading

andy108369 commented Jan 8, 2024

brewsterdrinkwater commented Jan 16, 2024

anilmurty commented Jan 16, 2024

brewsterdrinkwater commented Jan 23, 2024

andy108369 commented Jan 25, 2024 •

edited

Loading

AMD Support #142

AMD Support #142

Comments

anilmurty commented Oct 16, 2023 • edited Loading

brewsterdrinkwater commented Oct 24, 2023

anilmurty commented Nov 7, 2023 • edited Loading

brewsterdrinkwater commented Dec 5, 2023

brewsterdrinkwater commented Dec 12, 2023

brewsterdrinkwater commented Dec 19, 2023 • edited Loading

andy108369 commented Dec 20, 2023

andy108369 commented Dec 22, 2023 • edited Loading

Test run results

Next steps:

andy108369 commented Jan 8, 2024

brewsterdrinkwater commented Jan 16, 2024

anilmurty commented Jan 16, 2024

brewsterdrinkwater commented Jan 23, 2024

andy108369 commented Jan 25, 2024 • edited Loading

anilmurty commented Oct 16, 2023 •

edited

Loading

anilmurty commented Nov 7, 2023 •

edited

Loading

brewsterdrinkwater commented Dec 19, 2023 •

edited

Loading

andy108369 commented Dec 22, 2023 •

edited

Loading

andy108369 commented Jan 25, 2024 •

edited

Loading