Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ROCm support #993

Merged
merged 19 commits into from
Jan 3, 2025
Merged

Add ROCm support #993

merged 19 commits into from
Jan 3, 2025

Conversation

CSY-ModelCloud
Copy link
Member

No description provided.

@Qubitium Qubitium changed the title add rocm support Add ROCm support Jan 2, 2025
@Qubitium
Copy link
Collaborator

Qubitium commented Jan 2, 2025

Checklist

  • Validate all kernels. Find out which works and which does not.

Marlin is not supported, exllama v1/v2 are supported

  • Detect fake cuda (rocm) from real cuda:0 device.

check torch.version.hip is not None

  • What happens if both real and fake cuda exists? Both 4090 and 7900 are active at the same time? Is this possible?

torch rocm can't visit nvidia devices and torch cuda version can't visit amd

  • Modify SUPPORTS_DEVICE property for kernels to handle ROCm compat
  • Setup should skip download for nvidia/cuda precompiled whl

@CSY-ModelCloud
Copy link
Member Author

CSY-ModelCloud commented Jan 3, 2025

https://rocm.docs.amd.com/projects/HIPIFY/en/docs-6.1.0/tables/CUDA_Device_API_supported_by_HIP.html

nv_bfloat16 has no implementation in rocm

marlin kernel

      In file included from gptqmodel_ext/marlin/marlin_hip_kernel.hip:14:
      gptqmodel_ext/marlin/marlin_hip.cuh:60:41: error: use of undeclared identifier '__cvta_generic_to_shared'
         60 |   uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
            |                                         ^
      gptqmodel_ext/marlin/marlin_hip.cuh:67:18: error: invalid input constraint 'l' in asm
         67 |       "r"(smem), "l"(glob_ptr), "n"(BYTES));
            |                  ^
      gptqmodel_ext/marlin/marlin_hip.cuh:72:41: error: use of undeclared identifier '__cvta_generic_to_shared'
         72 |   uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
            |                                         ^
      gptqmodel_ext/marlin/marlin_hip.cuh:77:7: error: invalid input constraint 'l' in asm
         77 |       "l"(glob_ptr), "n"(BYTES));
            |       ^
      In file included from gptqmodel_ext/marlin/marlin_hip_kernel.hip:15:
      gptqmodel_ext/marlin/marlin_dtypes_hip.cuh:49:18: error: unknown type name 'nv_bfloat16'; did you mean 'hip_bfloat16'?
         49 | class ScalarType<nv_bfloat16> {
            |                  ^~~~~~~~~~~
            |                  hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      In file included from gptqmodel_ext/marlin/marlin_hip_kernel.hip:15:
      gptqmodel_ext/marlin/marlin_dtypes_hip.cuh:51:20: error: unknown type name 'nv_bfloat16'; did you mean 'hip_bfloat16'?
         51 |   using scalar_t = nv_bfloat16;
            |                    ^~~~~~~~~~~
            |                    hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      In file included from gptqmodel_ext/marlin/marlin_hip_kernel.hip:15:
      gptqmodel_ext/marlin/marlin_dtypes_hip.cuh:52:21: error: unknown type name 'nv_bfloat162'; did you mean 'hip_bfloat16'?
         52 |   using scalar_t2 = nv_bfloat162;
            |                     ^~~~~~~~~~~~
            |                     hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      In file included from gptqmodel_ext/marlin/marlin_hip_kernel.hip:15:
      gptqmodel_ext/marlin/marlin_dtypes_hip.cuh:54:21: error: unknown type name 'nv_bfloat162'; did you mean 'hip_bfloat16'?
         54 |   using FragA = Vec<nv_bfloat162, 4>;
            |                     ^~~~~~~~~~~~
            |                     hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      In file included from gptqmodel_ext/marlin/marlin_hip_kernel.hip:15:
      gptqmodel_ext/marlin/marlin_dtypes_hip.cuh:55:21: error: unknown type name 'nv_bfloat162'; did you mean 'hip_bfloat16'?
         55 |   using FragB = Vec<nv_bfloat162, 2>;
            |                     ^~~~~~~~~~~~
            |                     hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      In file included from gptqmodel_ext/marlin/marlin_hip_kernel.hip:15:
      gptqmodel_ext/marlin/marlin_dtypes_hip.cuh:57:21: error: unknown type name 'nv_bfloat162'; did you mean 'hip_bfloat16'?
         57 |   using FragS = Vec<nv_bfloat162, 1>;
            |                     ^~~~~~~~~~~~
            |                     hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      In file included from gptqmodel_ext/marlin/marlin_hip_kernel.hip:15:
      gptqmodel_ext/marlin/marlin_dtypes_hip.cuh:58:22: error: unknown type name 'nv_bfloat162'; did you mean 'hip_bfloat16'?
         58 |   using FragZP = Vec<nv_bfloat162, 4>;
            |                      ^~~~~~~~~~~~
            |                      hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:92:11: error: invalid output constraint '=f' in asm
         92 |         : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
            |           ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:95:47: error: unknown type name 'nv_bfloat16'; did you mean 'hip_bfloat16'?
         95 |   } else if constexpr (std::is_same<scalar_t, nv_bfloat16>::value) {
            |                                               ^~~~~~~~~~~
            |                                               hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:99:11: error: invalid output constraint '=f' in asm
         99 |         : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
            |           ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:103:5: error: unknown type name 'nv_bfloat16'; did you mean 'hip_bfloat16'?
        103 |     STATIC_ASSERT_SCALAR_TYPE_VALID(scalar_t);
            |     ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:19:44: note: expanded from macro 'STATIC_ASSERT_SCALAR_TYPE_VALID'
         19 |                     std::is_same<scalar_t, nv_bfloat16>::value, \
            |                                            ^
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:113:41: error: use of undeclared identifier '__cvta_generic_to_shared'
        113 |   uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
            |                                         ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:151:3: error: unknown type name 'nv_bfloat16'; did you mean 'hip_bfloat16'?
        151 |   STATIC_ASSERT_SCALAR_TYPE_VALID(scalar_t);
            |   ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:19:44: note: expanded from macro 'STATIC_ASSERT_SCALAR_TYPE_VALID'
         19 |                     std::is_same<scalar_t, nv_bfloat16>::value, \
            |                                            ^
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:177:39: error: unknown type name 'nv_bfloat16'; did you mean 'hip_bfloat16'?
        177 | __device__ inline typename ScalarType<nv_bfloat16>::FragB
            |                                       ^~~~~~~~~~~
            |                                       hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      gptqmodel_ext/marlin/marlin_hip_kernel.hip:178:14: error: unknown type name 'nv_bfloat16'; did you mean 'hip_bfloat16'?
        178 | dequant_4bit<nv_bfloat16>(int q) {
            |              ^~~~~~~~~~~
            |              hip_bfloat16
      /opt/rocm-6.3.1/lib/llvm/bin/../../../include/hip/amd_detail/amd_hip_bfloat16.h:57:8: note: 'hip_bfloat16' declared here
         57 | struct hip_bfloat16
            |        ^
      fatal error: too many errors emitted, stopping now [-ferror-limit=]
      20 errors generated when compiling for gfx1030.
      failed to execute:/opt/rocm-6.3.1/lib/llvm/bin/clang++  --offload-arch=gfx900 --offload-arch=gfx906 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx942  -I/home/work/GPTQModel/venv/lib/python3.12/site-packages/torch/include -I/home/work/GPTQModel/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/work/GPTQModel/venv/lib/python3.12/site-packages/torch/include/TH -I/home/work/GPTQModel/venv/lib/python3.12/site-packages/torch/include/THC -I/home/work/GPTQModel/venv/lib/python3.12/site-packages/torch/include/THH -I/opt/rocm-6.3.1/include -Igptqmodel_cuda -I/home/work/GPTQModel/venv/lib/python3.12/site-packages/nvidia/cuda_runtime/include -I/home/work/GPTQModel/venv/include -I/usr/include/python3.12 -c -x hip gptqmodel_ext/marlin/marlin_hip_kernel.hip -o "build/temp.linux-x86_64-cpython-312/gptqmodel_ext/marlin/marlin_hip_kernel.o" -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ -diag-suppress=179,39,186 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=gptqmodel_marlin_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      error: command '/opt/rocm-6.3.1/bin/hipcc' failed with exit code 1

@Qubitium Qubitium marked this pull request as ready for review January 3, 2025 05:08
@Qubitium
Copy link
Collaborator

Qubitium commented Jan 3, 2025

AMD is genius, NOT!, for not having a rocm device.

@Qubitium Qubitium merged commit 0b6924e into main Jan 3, 2025
4 checks passed
@CSY-ModelCloud CSY-ModelCloud deleted the CSY/rocm branch January 4, 2025 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants