can it work with AMD? #81

hiepxanh · 2022-10-14T14:14:11Z

hiepxanh
Oct 14, 2022

I have an AMD card rx580, I see that it NVIDA model but can I use AMD?

Oct 14, 2022

It can't, unfortunately. The NVIDIA FasterTransformer library only supports NVIDIA GPUs. The Feature Roadmap includes supporting models from Huggingface Transformers, which I believe should enable AMD GPU support as well.

View full answer

moyix · 2022-10-14T17:15:48Z

moyix
Oct 14, 2022
Maintainer

It can't, unfortunately. The NVIDIA FasterTransformer library only supports NVIDIA GPUs. The Feature Roadmap includes supporting models from Huggingface Transformers, which I believe should enable AMD GPU support as well.

13 replies

FredMcAwesome Apr 12, 2023

So I've played around with this a bit and it looks like you are correct in that we need a version of pytorch built for AMD GPUs. The problem is that we also require ROCm to be setup inside of docker, however ROCm currently doesn't officially support debian which is what the current docker image is based on.

Ideally we could start with a rocm base image and then install pytorch for ROCm on top of that. Would there be any issues with that, and do we need a specific version of python/pytorch installed?

If that is fine then I just need to work out how to add another docker file that is selected when a amd gpu is in use.

thakkarparth007 Apr 12, 2023
Maintainer

ROCm currently doesn't officially support debian

Lol of course why would life be simple.

Hmm. I don't know if there'll be issues with having both cuda drivers and ROCm installed. Probably shouldn't be an issue? The only thing would be to install Python 3.8 because the Triton's python backend wants 3.8 (https://github.com/triton-inference-server/python_backend).

Also I realized I mentioned the wrong dockerfile. You'll need to install this in triton.Dockerfile, not proxy.Dockerfile. The python backend runs in triton.Dockerfile's container.

FredMcAwesome Apr 13, 2023

You'll need to install this in triton.Dockerfile, not proxy.Dockerfile.

Well that is good news, because the triton Dockerfile is based off Ubuntu which is supported for ROCm. I have managed to get cuda to be available however, the bad news is that I am now running into a new problem where it is crashing anytime a request is made. See logs:

Logs


fauxpilot-triton-1         | 
fauxpilot-triton-1         | =============================
fauxpilot-triton-1         | == Triton Inference Server ==
fauxpilot-triton-1         | =============================
fauxpilot-triton-1         | 
fauxpilot-triton-1         | NVIDIA Release 22.06 (build 39726160)
fauxpilot-triton-1         | Triton Server Version 2.23.0
fauxpilot-triton-1         | 
fauxpilot-triton-1         | Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
fauxpilot-triton-1         | 
fauxpilot-triton-1         | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
fauxpilot-triton-1         | 
fauxpilot-triton-1         | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
fauxpilot-triton-1         | By pulling and using the container, you accept the terms and conditions of this license:
fauxpilot-triton-1         | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
fauxpilot-triton-1         | 
fauxpilot-triton-1         | WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
fauxpilot-triton-1         |    Use the NVIDIA Container Toolkit to start this container with GPU support; see
fauxpilot-triton-1         |    https://docs.nvidia.com/datacenter/cloud-native/ .
fauxpilot-triton-1         | 
fauxpilot-triton-1         | W0413 15:18:37.397113 61 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
fauxpilot-triton-1         | I0413 15:18:37.397414 61 cuda_memory_manager.cc:115] CUDA memory pool disabled
fauxpilot-triton-1         | I0413 15:18:37.398069 61 model_repository_manager.cc:1191] loading: py-model:1
fauxpilot-copilot_proxy-1  | INFO:     Started server process [1]
fauxpilot-copilot_proxy-1  | INFO:     Waiting for application startup.
fauxpilot-copilot_proxy-1  | INFO:     Application startup complete.
fauxpilot-copilot_proxy-1  | INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
fauxpilot-triton-1         | I0413 15:18:37.500539 61 python_be.cc:1774] TRITONBACKEND_ModelInstanceInitialize: py-model_0 (CPU device 0)
fauxpilot-triton-1         | Cuda available? True
fauxpilot-triton-1         | is_half: True, int8: False, auto_device_map: True
fauxpilot-triton-1         | Model Salesforce/codegen-350M-mono Loaded. Footprint: 797310976
fauxpilot-triton-1         | I0413 15:18:40.668195 61 model_repository_manager.cc:1345] successfully loaded 'py-model' version 1
fauxpilot-triton-1         | I0413 15:18:40.668282 61 server.cc:556] 
fauxpilot-triton-1         | +------------------+------+
fauxpilot-triton-1         | | Repository Agent | Path |
fauxpilot-triton-1         | +------------------+------+
fauxpilot-triton-1         | +------------------+------+
fauxpilot-triton-1         | 
fauxpilot-triton-1         | I0413 15:18:40.668333 61 server.cc:583] 
fauxpilot-triton-1         | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | Backend | Path                                                  | Config                                                                                                                                                         |
fauxpilot-triton-1         | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
fauxpilot-triton-1         | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | 
fauxpilot-triton-1         | I0413 15:18:40.668361 61 server.cc:626] 
fauxpilot-triton-1         | +----------+---------+--------+
fauxpilot-triton-1         | | Model    | Version | Status |
fauxpilot-triton-1         | +----------+---------+--------+
fauxpilot-triton-1         | | py-model | 1       | READY  |
fauxpilot-triton-1         | +----------+---------+--------+
fauxpilot-triton-1         | 
fauxpilot-triton-1         | I0413 15:18:40.668500 61 tritonserver.cc:2159] 
fauxpilot-triton-1         | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | Option                           | Value                                                                                                                                                                                        |
fauxpilot-triton-1         | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | | server_id                        | triton                                                                                                                                                                                       |
fauxpilot-triton-1         | | server_version                   | 2.23.0                                                                                                                                                                                       |
fauxpilot-triton-1         | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
fauxpilot-triton-1         | | model_repository_path[0]         | /model                                                                                                                                                                                       |
fauxpilot-triton-1         | | model_control_mode               | MODE_NONE                                                                                                                                                                                    |
fauxpilot-triton-1         | | strict_model_config              | 1                                                                                                                                                                                            |
fauxpilot-triton-1         | | rate_limit                       | OFF                                                                                                                                                                                          |
fauxpilot-triton-1         | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
fauxpilot-triton-1         | | response_cache_byte_size         | 0                                                                                                                                                                                            |
fauxpilot-triton-1         | | min_supported_compute_capability | 6.0                                                                                                                                                                                          |
fauxpilot-triton-1         | | strict_readiness                 | 1                                                                                                                                                                                            |
fauxpilot-triton-1         | | exit_timeout                     | 30                                                                                                                                                                                           |
fauxpilot-triton-1         | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1         | 
fauxpilot-triton-1         | I0413 15:18:40.669216 61 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
fauxpilot-triton-1         | I0413 15:18:40.669389 61 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
fauxpilot-triton-1         | I0413 15:18:40.710171 61 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

fauxpilot-triton-1 | The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. fauxpilot-triton-1 | Setting pad_token_id to eos_token_id:50256 for open-end generation. fauxpilot-triton-1 | E0413 15:19:28.961517 61 python_be.cc:1825] Stub process is unhealthy and it will be restarted. fauxpilot-copilot_proxy-1 | [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'py-model_0', message: Stub process is not healthy. fauxpilot-copilot_proxy-1 | Returned completion in 1003.0679702758789 ms fauxpilot-copilot_proxy-1 | INFO: 2023-04-13 15:19:28,962 :: 172.19.0.1:43992 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK fauxpilot-triton-1 | Cuda available? True fauxpilot-triton-1 | is_half: True, int8: False, auto_device_map: True fauxpilot-triton-1 | Model Salesforce/codegen-350M-mono Loaded. Footprint: 797310976

To test this I have removed the driver reservation for nvidia from the docker compose file and changed the triton.Dockerfile to install the AMD version of pytorch, and ROCm drivers. Any other ideas?

thakkarparth007 Apr 18, 2023
Maintainer

Sorry for the delay here. I think the best way is to try running the same thing without the triton stuff. So either in a docker container, or on the host itself, try creating a pytorch installation that works with AMD, and see if you can run inference on a model on GPU. Something as simple as:

!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-multi")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-multi", torch_dtype=torch.float16).to("cuda")

tokens = tok("def hello_world():\n", return_tensors="pt")
out = model.generate(tokens.input_ids.to("cuda"))
print(out.shape)

This might give some error message if there's a problem. If this doesn't pose any issues, then I'm not sure how to go about debugging that.

I've been wondering if we really need to use triton for python models. We're currently not really exploiting any feature of triton. It allows dynamic batching but we're not using it. Getting rid of the triton stuff for python models will allow it to be used on AMD, Mac and only-CPU machines too. So I think it's worth having two code-paths, one for fastertransformer going with triton, and another for vanilla huggingface models. But haven't made up my mind yet.

FredMcAwesome Apr 20, 2023

Ok, so after running that I was still seeing it failing. I investigated a bit and it looks like my GPU won't be supported in ROCm until the next release, which is why I am getting the segfaults. I will wait until next release and then try this again.

Thanks for your help.

Charuru · 2022-10-15T13:45:54Z

Charuru
Oct 15, 2022

If you're remotely interested in ML, just buy a nvidia card. This will not be the only AI application that you will want to use and run into this problem with in the near future.

1 reply

Blackclaws Jul 3, 2023

I'd like to add here that this is both good and bad advice. Its correct that a lot of the currently available ML models, examples and code relies somewhat on Cuda which is only supported on Nvidia GPUs. However Nvidia has schon time and time again that it is an anti-consumer company that doesn't really care that much about their non-enterprise users anymore. Given that and the really bad support for some of the newer GPUs on Linux (when it comes to anything but ML) makes me want to recommend AMD GPUs with the appropriate caveat that they might require some tinkering to get working with some ML models.

Tl;dr; If you care about the health of the ML ecosystem don't just default to Nvidia but encourage diversity and do things like this thread has done and figure out how to run Nvidia stuff on AMD or Intel GPUs.

aeyno · 2023-05-02T21:01:07Z

aeyno
May 2, 2023

I was able to run fauxpilot on AMD GPU (RX 6700XT) by changing the docker-compose.yaml and triton.Dockerfile files to support PyTorch with ROCm and to access my GPU:

Modified docker-compose.yaml:

version: '3.3'
services:
  triton:
    build:
      context: .
      dockerfile: triton.Dockerfile
    # HSA_OVERRIDE_GFX_VERSION=10.3.0 forces GFX version to 10.3.0 (this version appears to work with many AMD GPUs to run ROCm)     
    command: bash -c "HSA_OVERRIDE_GFX_VERSION=10.3.0 CUDA_VISIBLE_DEVICES=${GPUS} mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/model"
    shm_size: '2gb'
    volumes:
      - ${MODEL_DIR}:/model
      - ${HF_CACHE_DIR}:/root/.cache/huggingface
    ports:
      - "8000:8000"
      - "${TRITON_PORT}:8001"
      - "8002:8002"
    # Since the NVIDIA driver passthrough doesn't work for AMD GPUs, we run the container in privileged mode (not optimal security-wise but working)
    privileged: true
    #deploy:
    #  resources:
    #    reservations:
    #      devices:
    #        - driver: nvidia
    #          count: all
    #          capabilities: [gpu]
  copilot_proxy:
    # For dockerhub version
    # image: moyix/copilot_proxy:latest
    # For local build
    build:
      context: .
      dockerfile: proxy.Dockerfile
    command: uvicorn app:app --host 0.0.0.0 --port 5000
    env_file:
      # Automatically created via ./setup.sh
      - .env
    ports:
      - "${API_EXTERNAL_PORT}:5000"

Modified triton.Dockerfile:

FROM moyix/triton_with_ft:22.09

# Install dependencies: torch
# Here we use the latest version of pytorch with ROCm (see https://pytorch.org/get-started/locally/)
RUN python3 -m pip install --disable-pip-version-check -U torch --extra-index-url https://download.pytorch.org/whl/rocm5.4.2
RUN python3 -m pip install --disable-pip-version-check -U transformers bitsandbytes accelerate

With only those modifications, you should be able to run fauxpilot on AMD GPUs.

3 replies

FredMcAwesome May 2, 2023

Are you sure that is running on the GPU? My understanding is that you also need to add:

group_add:
      - video
    devices:
      - "/dev/kfd"
      - "/dev/dri"
    security_opt:
      - seccomp:unconfined

to the docker-compose.yaml in order to pass in the GPU, and:

RUN curl -sL https://repo.radeon.com/amdgpu-install/5.4.3/ubuntu/focal/amdgpu-install_5.4.50403-1_all.deb --output amdgpu-install_5.4.50403-1_all.deb
RUN apt-get install -y ./amdgpu-install_5.4.50403-1_all.deb
RUN amdgpu-install -y --usecase=rocm --no-dkms

to the triton dockerfile in order to enable the drivers.

When you disable the nvidia driver resource in the docker-compose.yaml file it will default to cpu if it can't access an appropriate gpu. Can you confirm the gpu is doing the work not the cpu?

aeyno May 2, 2023

Yes I can confirm that the GPU is doing the computation (I use radeontop to monitor it).

With privileged: true in the docker-compose, the container has access to all devices

FredMcAwesome May 2, 2023

Ah great news! Note that the rocm version of pytorch is incompatible with nvidia (it replaces calls to cuda with calls to rocm), so this project will need some way to only install that version if an amd gpu is in use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can it work with AMD? #81

{{title}}

Replies: 3 comments 17 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

can it work with AMD? #81

Replies: 3 comments · 17 replies

moyix Oct 14, 2022 Maintainer

thakkarparth007 Apr 12, 2023 Maintainer

thakkarparth007 Apr 18, 2023 Maintainer

Replies: 3 comments 17 replies

moyix
Oct 14, 2022
Maintainer

thakkarparth007 Apr 12, 2023
Maintainer

thakkarparth007 Apr 18, 2023
Maintainer