Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ROCM 6 #82

Open
jalberto opened this issue May 13, 2024 · 14 comments
Open

Support for ROCM 6 #82

jalberto opened this issue May 13, 2024 · 14 comments

Comments

@jalberto
Copy link

jalberto commented May 13, 2024

It seems ROCM 5.6 kind of works, but it really requires too much back and forth to have everything working, the new Fedora 40 brings official ROCM support but starting in ROCM 6.

I am using this config from #63

Mix.install(
  [
    {:web_driver_client, "~> 0.2.0"},
    {:kino, "~> 0.12.3"},
    {:req, "~> 0.4.14"},
    {:erlexec, "~> 2.0"},
    {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
    {:exla, github: "elixir-nx/nx", sparse: "exla", override: true}
  ],
  system_env: %{
    "XLA_ARCHIVE_URL" =>
      "https://static.jonatanklosko.com/builds/0.6.0/xla_extension-x86_64-linux-gnu-rocm.tar.gz",
    "ROCM_PATH" => "/usr/lib64/rocm/"
  },
  config: [nx: [default_backend: {EXLA.Backend, client: :host}]]

I managed to find every pkgs it was asking for (this took a while of back and forth) until I reached this:

18:36:37.767 [warning] The on_load function for module Elixir.EXLA.NIF returned:
{:error,
 {:load_failed,
  ~c"Failed to load NIF library /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/f3927a87654a1bf097d7e31b6277a9f8/_build/dev/lib/exla/priv/libexla: 'librocblas.so.3: cannot open shared object file: No such file or directory'"}}

My guess is xla_extension needs to be built for rocm 7 (librocblas.s0.4), I tried to build it myself but the requirements are too way off the current system (gcc versions and so on)

Will be great if there were official xla binaries for different ROCM versions, as there are for CUDA.

I understand ROCM support is in low priority, but it is really nice for start in AI as it works nicely in linux

@jalberto
Copy link
Author

I am also trying to reproduce the build by using the provided dockerfiles, but I always get errors:

[3,765 / 6,478] Compiling mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp; 22s local ... (16 actions, 15 running)
ERROR: /app/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/service/gpu/BUILD:1158:23: Compiling xla/service/gpu/cub_sort_kernel.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target //xla/service/gpu:cub_sort_kernel_u32) external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer ... (remaining 100 arguments skipped)
clang++: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument]
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr41 = V_MOV_B32_dpp undef $vgpr41(tied-def 0), $vgpr4, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), killed $vgpr3, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr2, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr42 = V_MOV_B32_dpp undef $vgpr42(tied-def 0), $vgpr8, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
12 errors generated when compiling for gfx1036.
Target //xla/extension:xla_extension failed to build

@jonatanklosko
Copy link
Member

Did you try building by setting the XLA revision as in #63 (comment)?

Setting up the right environment for building was an issue before, that's why we have the Dockerfile. I don't know about ROCM 6, my best bet would be on updating to newer XLA could fix the build, but that usually involves changes to EXLA too. I think it would be a good idea to update sometime soon anyway, but no guarantees.

You could perhaps use Docker with 5.6 for computations/experimentation altogether, though I get it's not very convenient.

@jonatanklosko
Copy link
Member

@jalberto I updated to the latest XLA revision and EXLA main already uses that. I tried building with ROCm 5.7, but there were errors indicating that XLA already assumes 6.0 (using symbols defined in 6.0+). So I updated the Docker image and managed to successfully build with ROCm 6.0.

Please try XLA_ARCHIVE_URL=https://static.jonatanklosko.com/builds/0.7.0/xla_extension-x86_64-linux-gnu-rocm.tar.gz and nx/exla main. If it doesn't work, you can also try building locally.

@jalberto
Copy link
Author

thanks, @jonatanklosko will test and report back

@jalberto
Copy link
Author

jalberto commented Jun 7, 2024

@jonatanklosko sorry for the delay, now I have a different error:

: CommandLine Error: Option 'x86-disable-avoid-SFB' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options

@jonatanklosko
Copy link
Member

@jalberto is it when loading the precompiled binary or during build?

@jalberto
Copy link
Author

jalberto commented Jun 7, 2024

image

That is what happens when I try to rebuild without cache, and the LLVM error is in the console when I start the livebook server

@jalberto
Copy link
Author

jalberto commented Jun 7, 2024

@jonatanklosko in case it helps:

* Getting nx (https://github.com/elixir-nx/nx.git - origin/main)
remote: Enumerating objects: 22709, done.        
remote: Counting objects: 100% (4025/4025), done.        
remote: Compressing objects: 100% (780/780), done.        
remote: Total 22709 (delta 3456), reused 3661 (delta 3202), pack-reused 18684        
* Getting exla (https://github.com/elixir-nx/nx.git - origin/main)
remote: Enumerating objects: 22709, done.        
remote: Counting objects: 100% (4047/4047), done.        
remote: Compressing objects: 100% (776/776), done.        
remote: Total 22709 (delta 3480), reused 3687 (delta 3228), pack-reused 18662        
Resolving Hex dependencies...
Resolution completed in 0.126s
New:
  castore 1.0.7
  certifi 2.12.0
  complex 0.5.0
  elixir_make 0.8.4
  erlexec 2.0.6
  finch 0.18.0
  fss 0.1.1
  hackney 1.20.1
  hpax 0.2.0
  idna 6.1.1
  jason 1.4.1
  kino 0.12.3
  metrics 1.0.1
  mime 2.0.5
  mimerl 1.3.0
  mint 1.6.0
  nimble_options 1.1.1
  nimble_ownership 0.3.1
  nimble_pool 1.1.0
  parse_trans 3.4.1
  req 0.4.14
  ssl_verify_fun 1.1.7
  table 0.1.2
  telemetry 1.2.1
  tesla 1.9.0
  unicode_util_compat 0.7.0
  web_driver_client 0.2.0
  xla 0.7.0
* Getting web_driver_client (Hex package)
* Getting kino (Hex package)
* Getting req (Hex package)
* Getting erlexec (Hex package)
* Getting telemetry (Hex package)
* Getting xla (Hex package)
* Getting elixir_make (Hex package)
* Getting nimble_pool (Hex package)
* Getting complex (Hex package)
* Getting finch (Hex package)
* Getting jason (Hex package)
* Getting mime (Hex package)
* Getting nimble_ownership (Hex package)
* Getting castore (Hex package)
* Getting mint (Hex package)
* Getting nimble_options (Hex package)
* Getting hpax (Hex package)
* Getting fss (Hex package)
* Getting table (Hex package)
* Getting hackney (Hex package)
* Getting tesla (Hex package)
* Getting certifi (Hex package)
* Getting idna (Hex package)
* Getting metrics (Hex package)
* Getting mimerl (Hex package)
* Getting parse_trans (Hex package)
* Getting ssl_verify_fun (Hex package)
* Getting unicode_util_compat (Hex package)
==> table
Compiling 5 files (.ex)
Generated table app
==> mime
Compiling 1 file (.ex)
Generated mime app
==> nimble_options
Compiling 3 files (.ex)
Generated nimble_options app
===> Analyzing applications...
===> Compiling unicode_util_compat
===> Analyzing applications...
===> Compiling idna
===> Analyzing applications...
===> Compiling telemetry
==> jason
Compiling 10 files (.ex)
Generated jason app
==> hpax
Compiling 4 files (.ex)
Generated hpax app
===> Analyzing applications...
===> Compiling mimerl
==> ssl_verify_fun
Compiling 7 files (.erl)
Generated ssl_verify_fun app
==> fss
Compiling 4 files (.ex)
Generated fss app
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 35 files (.ex)
Generated nx app
==> kino
Compiling 47 files (.ex)
Generated kino app
===> Analyzing applications...
===> Compiling certifi
===> Analyzing applications...
===> Compiling parse_trans
==> nimble_pool
Compiling 2 files (.ex)
Generated nimble_pool app
===> Fetching rebar3_hex v7.0.7
===> Fetching hex_core v0.8.4
===> Fetching verl v1.1.1
===> Analyzing applications...
===> Compiling hex_core
===> Compiling verl
===> Compiling rebar3_hex
===> Fetching rebar3_ex_doc v0.2.22
===> Analyzing applications...
===> Compiling rebar3_ex_doc
make: Entering directory '/home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/c_src'
g++ -g -std=c++11 -finline-functions -Wall -DHAVE_PTRACE -MMD -DUSE_POLL=1 -O3 -DNDEBUG -DHAVE_SETRESUID -DHAVE_PIPE2   -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -I/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/include  -c -o ei++.o ei++.cpp
g++ -g -std=c++11 -finline-functions -Wall -DHAVE_PTRACE -MMD -DUSE_POLL=1 -O3 -DNDEBUG -DHAVE_SETRESUID -DHAVE_PIPE2   -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -I/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/include  -c -o exec.o exec.cpp
g++ -g -std=c++11 -finline-functions -Wall -DHAVE_PTRACE -MMD -DUSE_POLL=1 -O3 -DNDEBUG -DHAVE_SETRESUID -DHAVE_PIPE2   -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -I/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/include  -c -o exec_impl.o exec_impl.cpp
mkdir -p /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/priv/x86_64-redhat-linux/
mkdir -p "/home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/priv/x86_64-redhat-linux/"
g++ ei++.o exec.o exec_impl.o -L/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/lib -lei -o /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/priv/x86_64-redhat-linux/exec-port
make: Leaving directory '/home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/c_src'
===> Analyzing applications...
===> Compiling erlexec
===> Analyzing applications...
===> Compiling metrics
===> Analyzing applications...
===> Compiling hackney
==> castore
Compiling 1 file (.ex)
Generated castore app
==> elixir_make
Compiling 8 files (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
==> exla
Unpacking /home/ja/.cache/xla/0.7.0/cache/external/xla_extension-4j534fd5eueir3oelhrj2pvadm.tar.gz into /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/exla/exla/cache
Using libexla.so from /home/ja/.cache/xla/exla/elixir-1.16.2-erts-14.2.5-xla-0.7.0-exla-0.7.1-4hm2i3sdtzvi2nwhnlfl4jx27u/libexla.so
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla.cc -o cache/objs/exla.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_mlir.cc -o cache/objs/exla_mlir.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/custom_calls.cc -o cache/objs/custom_calls.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_client.cc -o cache/objs/exla_client.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_cuda.cc -o cache/objs/exla_cuda.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_nif_util.cc -o cache/objs/exla_nif_util.o
g++ cache/objs/exla.o cache/objs/exla_mlir.o cache/objs/custom_calls.o cache/objs/exla_client.o cache/objs/exla_nif_util.o cache/objs/exla_cuda.o -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -shared -Wl,-rpath,'$ORIGIN/xla_extension/lib'
Compiling 23 files (.ex)

@jonatanklosko
Copy link
Member

As a sanity check, try without XLA_ARCHIVE_URL, which by default should download just the CPU-enabled binary. This way we will know if it is specific to the ROCm binary. Make sure to reinstall without cache.

@jalberto
Copy link
Author

jalberto commented Jun 7, 2024

yes, that worked as expected, no issues

As a side note: I have same issues building with the new dockerfile

@jonatanklosko
Copy link
Member

I see, I have no idea where this LLVM error is coming from, I didn't find x86-disable-avoid-SFB, nor X86AvoidStoreForwardingBlocks in openxla/xla source mentioned explicitly. You can try building youtself with XLA_BUILD=1 just in case, but that's a long shot (and provided that it builds without issues) :<

@monorkin
Copy link

monorkin commented Nov 6, 2024

Not sure if this is completely related, but I'm trying to get ROCm 6 working too and built xla with the Dockerized build.sh script which gave me a tarball.

When I set XLA_ARCHIVE_PATH to https://s3.fr-par.scw.cloud/assets.stanko.io/hex/xla/rocm/0.8.0/xla_extension-x86_64-linux-gnu-rocm.tar.gz (which the tarball I built), Mix.install passes, but when something calls EXLA I get a unable to find ld.lld in PATH: No such file or directory error. I tried adding both ROCm bins and llvm bins to my PATH (export PATH="/opt/rocm/llvm/bin:/opt/rocm/bin:$PATH"), but I still get the same error even though I can invoke ld.lld from my shell, and I run livebook server from the same shell.

image

Just wanted to ask if this is a known problem, or if someone has some pointers for debugging this?
Is there another way to use the built tarball except uploading it somewhere and setting XLA_ARCHIVE PATH?

And are there plans to provide pre-built ROCm packages like for CUDA?
I know, AMD GFX isn't popular in data centers, but from my experience it's fairly common on desktops and development machines.

@jonatanklosko
Copy link
Member

@monorkin it may not be related, but the only thing I can think of is to also set export ROCM_PATH="/opt/rocm-6.0" (or whatever the version is).

Is there another way to use the built tarball except uploading it somewhere and setting XLA_ARCHIVE PATH?

I've just added support for XLA_ARCHIVE_PATH (#99), but that's going to be applicable only from the next release, so for now you need to use XLA_ARCHIVE_URL.

And are there plans to provide pre-built ROCm packages like for CUDA?

Not at the moment. The ROCm support is somewhat experimental, in the sense that we don't have the capacity to test it on every release and maintain possibly multiple precompiled builds. Jax (the Python library using XLA) also considers it experimental. This may change in the future, depending on how the ROCm prominence evolves upstream.

@monorkin
Copy link

monorkin commented Nov 7, 2024

@jonatanklosko that did the trick! Thank you!

Now I have a different problem where after creating a serving the runtime crashes.

image

I added LIVEBOOK_DEBUG=true before running the server, but the log just stops before the crash.

image

Is there a way to increase the verbosity? Or another way to check why the runtime crashed?

UPDATE:
Seems I've run into an OOM issue with my graphics card similar to this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants