-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AMDGPU #6434
Comments
It seems better to add cmake at the end(after step5) |
Issue: # #6434 ### Brief Summary Add pr tag for amdgpu
Here is task list.
|
Amazing works. I'm also considering this task since our new HPC is using Sugon gtx106 (known as Vega 20). I am quite interested is there any possible to cooperate ? Besides, |
@ITCJ Certainly! Now, I have implemented a prototype for |
@galeselee yeah, it's
AND, |
Issue: #6434 ### Brief Summary This dockerfile refs to Dockerfile.ubuntu.20.04. The differences are following 1. based docker image 2. del python3-pip, mesa-common-dev which are already downloaded in based image. 3. remove vulkan module(amdgpu is not yet adapted to vulkan) 4. add user into video usergroup. (Some Ubuntu-20.04 versions also require render group)
Issue: #6434 ### Brief Summary Add the logic of docker for AMDGPU ci 1. The llvm used which only contains AMDGPU and X86 targets is from docker image 2. Using AMDGPU in docker requires that `/dev/kfd` and the directory `/dev/dri` be mounted on. For `dev` user, there is no permission to access these character devices by default. 3. `TI_WITH_CUDA=OFF` 4. `TI_RUN_RELEASE_TESTS=OFF` 5. Currently only run cpu-relected test
Issue: #6434 ### Brief Summary 1. This is a special part of the Tacihi runtime module for the `AMDGPU` backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is different in memory allocation differs from the cpu-generated `LLVM IR`. The following is an example. ``` C/C++ code void func(int *a, int *b) { *a = *b; } x86_64 backend LLVM IR define dso_local void @cpu_func(i32* %0, i32* %1) #2 { %3 = alloca i32*, align 8 %4 = alloca i32*, align 8 store i32* %0, i32** %3, align 8 store i32* %1, i32** %4, align 8 %5 = load i32*, i32** %4, align 8 %6 = load i32, i32* %5, align 4 %7 = load i32*, i32** %3, align 8 store i32 %6, i32* %7, align 4 ret void } __global__ function on AMDGPU define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) #4 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = alloca i32*, align 8, addrspace(5) %6 = alloca i32*, align 8, addrspace(5) %7 = addrspacecast i32* addrspace(5)* %3 to i32** %8 = addrspacecast i32* addrspace(5)* %4 to i32** %9 = addrspacecast i32* addrspace(5)* %5 to i32** %10 = addrspacecast i32* addrspace(5)* %6 to i32** %11 = addrspacecast i32 addrspace(1)* %0 to i32* store i32* %11, i32** %7, align 8 %12 = load i32*, i32** %7, align 8 %13 = addrspacecast i32 addrspace(1)* %1 to i32* store i32* %13, i32** %8, align 8 %14 = load i32*, i32** %8, align 8 store i32* %12, i32** %9, align 8 store i32* %14, i32** %10, align 8 %15 = load i32*, i32** %10, align 8 %16 = load i32, i32* %15, align 4 %17 = load i32*, i32** %9, align 8 store i32 %16, i32* %17, align 4 ret void } __device__ function on AMDGPU define hidden void @device_func(i32* %0, i32* %1) #2 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = addrspacecast i32* addrspace(5)* %3 to i32** %6 = addrspacecast i32* addrspace(5)* %4 to i32** store i32* %0, i32** %5, align 8 store i32* %1, i32** %6, align 8 %7 = load i32*, i32** %6, align 8 %8 = load i32, i32* %7, align 4 %9 = load i32*, i32** %5, align 8 store i32 %8, i32* %9, align 4 ret void } ``` 2. There are some differences in the place about `allocainst`, specifically about addrspace (for `AMDGPU`, [this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be helpful). I have not found documentation describing how to write the correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR` generated by `clang++/hipcc`. We need to deal with the arguments of the `__global__` function and the `allocainst` (including specifying the addrspace of `allocainst` and performing addrspace-cast) while for the `__device__` function we do not need to deal with the arguments of the function. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: ##6434 Add some unit tests on the AMDGPU RHI level Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: ##6434 ### Brief Summary It contains four parts(`driver`, `context`, `device` and `caching_allocator`). The code is similar to `cuda/rhi`. However, there are still some differences between `amdgpu/rhi` and `cuda/rhi` #### context 1. The method of obtaining the hardware version 2. Context::launch #### driver 1. ROCm/hip internal functions #### cmake The current cmake compilation system is sufficient to support the Unit test in #6597 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: taichi-dev#6434 ### Brief Summary This dockerfile refs to Dockerfile.ubuntu.20.04. The differences are following 1. based docker image 2. del python3-pip, mesa-common-dev which are already downloaded in based image. 3. remove vulkan module(amdgpu is not yet adapted to vulkan) 4. add user into video usergroup. (Some Ubuntu-20.04 versions also require render group)
Issue: taichi-dev#6434 ### Brief Summary Add the logic of docker for AMDGPU ci 1. The llvm used which only contains AMDGPU and X86 targets is from docker image 2. Using AMDGPU in docker requires that `/dev/kfd` and the directory `/dev/dri` be mounted on. For `dev` user, there is no permission to access these character devices by default. 3. `TI_WITH_CUDA=OFF` 4. `TI_RUN_RELEASE_TESTS=OFF` 5. Currently only run cpu-relected test
Issue: taichi-dev#6434 ### Brief Summary 1. This is a special part of the Tacihi runtime module for the `AMDGPU` backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is different in memory allocation differs from the cpu-generated `LLVM IR`. The following is an example. ``` C/C++ code void func(int *a, int *b) { *a = *b; } x86_64 backend LLVM IR define dso_local void @cpu_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8 %4 = alloca i32*, align 8 store i32* %0, i32** %3, align 8 store i32* %1, i32** %4, align 8 %5 = load i32*, i32** %4, align 8 %6 = load i32, i32* %5, align 4 %7 = load i32*, i32** %3, align 8 store i32 %6, i32* %7, align 4 ret void } __global__ function on AMDGPU define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) taichi-dev#4 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = alloca i32*, align 8, addrspace(5) %6 = alloca i32*, align 8, addrspace(5) %7 = addrspacecast i32* addrspace(5)* %3 to i32** %8 = addrspacecast i32* addrspace(5)* %4 to i32** %9 = addrspacecast i32* addrspace(5)* %5 to i32** %10 = addrspacecast i32* addrspace(5)* %6 to i32** %11 = addrspacecast i32 addrspace(1)* %0 to i32* store i32* %11, i32** %7, align 8 %12 = load i32*, i32** %7, align 8 %13 = addrspacecast i32 addrspace(1)* %1 to i32* store i32* %13, i32** %8, align 8 %14 = load i32*, i32** %8, align 8 store i32* %12, i32** %9, align 8 store i32* %14, i32** %10, align 8 %15 = load i32*, i32** %10, align 8 %16 = load i32, i32* %15, align 4 %17 = load i32*, i32** %9, align 8 store i32 %16, i32* %17, align 4 ret void } __device__ function on AMDGPU define hidden void @device_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = addrspacecast i32* addrspace(5)* %3 to i32** %6 = addrspacecast i32* addrspace(5)* %4 to i32** store i32* %0, i32** %5, align 8 store i32* %1, i32** %6, align 8 %7 = load i32*, i32** %6, align 8 %8 = load i32, i32* %7, align 4 %9 = load i32*, i32** %5, align 8 store i32 %8, i32* %9, align 4 ret void } ``` 2. There are some differences in the place about `allocainst`, specifically about addrspace (for `AMDGPU`, [this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be helpful). I have not found documentation describing how to write the correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR` generated by `clang++/hipcc`. We need to deal with the arguments of the `__global__` function and the `allocainst` (including specifying the addrspace of `allocainst` and performing addrspace-cast) while for the `__device__` function we do not need to deal with the arguments of the function. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: #taichi-dev#6434 Add some unit tests on the AMDGPU RHI level Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: #taichi-dev#6434 ### Brief Summary It contains four parts(`driver`, `context`, `device` and `caching_allocator`). The code is similar to `cuda/rhi`. However, there are still some differences between `amdgpu/rhi` and `cuda/rhi` #### context 1. The method of obtaining the hardware version 2. Context::launch #### driver 1. ROCm/hip internal functions #### cmake The current cmake compilation system is sufficient to support the Unit test in taichi-dev#6597 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…7023) Issue: #taichi-dev#6434 ### Brief Summary These unit tests are for taichi-dev#6486 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: #taichi-dev#6434 ### Brief Summary 1. Add exclusive way to process kernel argument. 2. Upate `launch` api in `AMDGPUContext` . Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: taichi-dev#6434 ### Brief Summary This part contains `CHI IR->LLVM IR` part. Similar to part0, the skeleton of this part is similar to `cuda`. Some important notes are following. 1. `ocml` is the math library like 'libdevice' in `nvidia`. It provides most of the functions we need. Remains such as `sgn`, and `abs` have been completed by hand codes 2. The kernel parameter passing part is different from the `CUDA`. `extra_args` is the only way that could be accepted by `device kernel.` 3. still uses jargon from `nvidia`.(e.g. `sm`, `block` and `grid`) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: taichi-dev#6434 ### Brief Summary 1. This is the third part of adding the backend of amdgpu: adding the runtime part of the implementation. The main code for runtime is llvm ir generating gcn-isa/object and hsaco (which is a file format that can be accepted by the module launch api provided by hip) 2. After calling the relevant api to generate the gcn isa/obj, the linker of llvm (ld.lld) needs to be called to generate the hsaco file format, so there is a command line call `ld.lld -shared xxx.o -o xxx.hsaco` in the code, and the temporarily generated file is stored in the `/tmp/taichi_hsaco/` folder 3. To deal with the problem of multiple `hsaco` files being generated at the same time, a random number is used to name the related generated files, as follows: in `JITSessionAMDGPU` there is a `random_num_` and `tmp_dir_` which are assigned when the `JITSessionAMDGPU` instance is created. Each `ti.kernel` will be devided into offload-tasks which is compiled into a separate `hsaco` file. A random number bound to the `hsaco` file is obtained when the `hsaco` file is generated. Here is an example of the file after running the `ti example mpm128`: ``` taichi_hsaco/ └── 4858208420434830779 ├── taichi_amdgcn_10476395765980093855.hsaco ├── taichi_amdgcn_10476395765980093855.o ├── taichi_amdgcn_11369096326162657620.hsaco ├── taichi_amdgcn_11369096326162657620.o ├── taichi_amdgcn_11700031850871498261.hsaco ├── taichi_amdgcn_11700031850871498261.o ├── taichi_amdgcn_14803499569653867868.hsaco ├── taichi_amdgcn_14803499569653867868.o ├── taichi_amdgcn_14949458395707884954.hsaco ├── taichi_amdgcn_14949458395707884954.o ├── taichi_amdgcn_15955762247261446379.hsaco ├── taichi_amdgcn_15955762247261446379.o ├── taichi_amdgcn_16891452471041191610.hsaco ├── taichi_amdgcn_16891452471041191610.o ├── taichi_amdgcn_17615766226135707772.hsaco ├── taichi_amdgcn_17615766226135707772.o ├── taichi_amdgcn_18033844193337069056.hsaco ├── taichi_amdgcn_18033844193337069056.o ├── taichi_amdgcn_5951151729973841331.hsaco ├── taichi_amdgcn_5951151729973841331.o ├── taichi_amdgcn_6012043323411824926.hsaco ├── taichi_amdgcn_6012043323411824926.o ├── taichi_amdgcn_6796840558965541322.hsaco ├── taichi_amdgcn_6796840558965541322.o ├── taichi_amdgcn_6835984424286808860.hsaco ├── taichi_amdgcn_6835984424286808860.o ├── taichi_amdgcn_7872622170129629907.hsaco ├── taichi_amdgcn_7872622170129629907.o ├── taichi_amdgcn_8760441738982760858.hsaco ├── taichi_amdgcn_8760441738982760858.o ├── taichi_amdgcn_9006625347419529255.hsaco └── taichi_amdgcn_9006625347419529255.o ``` Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: taichi-dev#6434 ### Brief Summary 1. put amdgpu-related bitcode files in external/amdgpu_libdevice(thus taichi just need libamdhip64.so) 2. link amdgpu-related bc files and taichi-module together(Precisely, clone bc files into taichi-module) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: taichi-dev#6434 ### Brief Summary 1. update `runtime/llvm/runtime_module/runtime.cpp" to enable amd gpu backend
Issue: #taichi-dev#6434 ### Brief Summary 1. refactor `add_struct_for_func` into llvm_pass 2. replace `bool spmd` with `string spmd` to recognize `amdgpu` and `nvgpu`. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: taichi-dev#6434 ### Brief Summary 1. enable amdgpu api in taichi(except struct for) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: #taichi-dev#6434 ### Brief Summary 1. fix amdgpu backend bugs: a. codegen typos b. add more types support for `sgn` c. use temporary method to handle `runtime.cpp` error 2. enable amdgpu backend python unit test a. because of the lack of `print` support, temporarily disable all `print` related tests on amdgpu backend. b. there is still something wrong in `gdar_mpm` and `ad_if` tests. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: #taichi-dev#6434 ### Brief Summary 1. Currently only default(event) profiler is available on AMDGPU 2. Here is the show <img width="1198" alt="image" src="https://user-images.githubusercontent.com/47965866/217734581-4c7f7fa7-4d17-4243-b4bd-0a70d1c88f4a.png"> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: #taichi-dev#6434 ### Brief Summary --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Is there any plan to support cdna or cdna2 architecture? I hope to run my taichi code on MI series gpu. |
Actually, you can run your taichi code on GPU with cdna or cdna2 architecture with a little modification. (I have test some cases on MI210) |
How? I thought the current taichi doesn't support amd gpu |
It looks like the amdgpu backend hasn't been released yet, you can either do the compilation yourself (I can sort out how to compile it if you need help) or try using the vulkan backend. |
|
I've managed to build Taichi for AMDGPU, but one problem I ran into is that recent (5.3+) versions of ROCm require PCIe atomics, which are not available on my setup. The most recent version, 5.7 includes a workaround for setups without PCIe atomics: https://rocm.docs.amd.com/en/latest/release.html#non-hostcall-hip-printf. I had a go at building taichi against LLVM 17, which is the version of LLVM that ROCm 5.7 works with. I haven't been able to test said changes because they require the ROCM 5.7 runtime to be installed and this isn't out in the Arch Linux repositories yet. Here are the changes I've had to make: expenses@800c693. I'm not sure if this information is super useful, but it might help some people. |
I'd just like to note that the AMDGPU backend ( Currently the latest LLVM version Taichi officially supports is LLVM 15. I have LLVM 16 working locally, but LLVM 17+ needs some thought because the need (or not?) to migrate to LLVM's new pass manager infrastructure (e.g. |
@GZGavinZhao Thanks for your sharing. And I'm very glad to know this. I have previous experience migrating from a lower version of LLVM to 14. Some of LLVM code need changed according to the LLVM docs due to the change of the LLVM API, like |
Purpose
I find Taichi users have a demand for AMDGPU. (related issue #4586)
There is also a discussion of the AMD GPU backend here #412
Thus, I would like to add an AMDGPU backend to the compiler so that my 6900XT can be utilized.
Solution
The purpose is to give Taichi preliminary support for AMDPU via LLVM-14 and ROCm.
I started by completing the following steps step by step
The text was updated successfully, but these errors were encountered: