Add XLA flag workaround for CUDA Capability 7.x GPUs

PiperOrigin-RevId: 705559274
google-deepmind · Dec 13, 2024 · 781d8d0 · 781d8d0
1 parent 2e56235
commit 781d8d0
Show file tree

Hide file tree

Showing 6 changed files with 41 additions and 52 deletions.
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -55,7 +55,10 @@ RUN build_data
 
 # To work around a known XLA issue causing the compilation time to greatly
 # increase, the following environment variable setting XLA flags must be enabled
-# when running AlphaFold 3:
+# when running AlphaFold 3. Note that if using CUDA capability 7 GPUs, it is
+# necessary to set the following XLA_FLAGS value instead:
+# ENV XLA_FLAGS="--xla_disable_hlo_passes=custom-kernel-fusion-rewriter"
+# (no need to disable gemm in that case as it is not supported for such GPU).
 ENV XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
 # Memory settings used for folding up to 5,120 tokens on A100 80 GB.
 ENV XLA_PYTHON_CLIENT_PREALLOCATE=true

diff --git a/docs/known_issues.md b/docs/known_issues.md
@@ -1,39 +1,11 @@
 # Known Issues
 
-## Numerical performance for different GPU devices
-
-There are numerical performance issues with some GPU types that are under
-investigation, see
-[this Issue](https://github.com/google-deepmind/alphafold3/issues/59) for
-tracking.
-
-### Verified devices
-
-We have run successful large-scale numerical tests for the following devices and
-maximum number of tokens:
-
--   H100 80 GB: up to 5,120 tokens.
--   A100 80 GB: up to 5,120 tokens.
--   A100 40 GB: up to 4,352 tokens with
-    [unified memory configuration](https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md#nvidia-a100-40-gb).
--   P100 16 GB: up to 1,024 tokens.
-
-Note that the 80 GB devices can run larger targets using unified memory, but
-outputs have only been verified on particular examples rather than a large-scale
-test set.
-
-#### CUDA Capability 7.x GPUs: known issues
+## Numerical performance for CUDA Capability 7.x GPUs
 
 All CUDA Capability 7.x GPUs (e.g. V100) produce obviously bad output, with lots
-of clashing residues (the clashes cause a ranking score of -99 or lower). With a
-small fix relating to `bfloat16` conversion to `float32` outputs look normal,
-but there are numerical performance regressions for some bucket sizes (tested on
-V100 devices).
-
-#### CUDA Capability 6.x GPUs: no known issues
-
-CUDA Capability 6.x GPUs give reasonable output, but large scale numerical
-testing has only been done for P100.
+of clashing residues (the clashes cause a ranking score of -99 or lower), unless
+the environment variable `XLA_FLAGS` is set to include
+`--xla_disable_hlo_passes=custom-kernel-fusion-rewriter`.
 
 ## Incorrect handling of two-letter atoms in SMILES ligands
 

diff --git a/docs/performance.md b/docs/performance.md
@@ -98,26 +98,28 @@ AlphaFold 3 can run on inputs of size up to 4,352 tokens on a single NVIDIA A100
 While numerically accurate, this configuration will have lower throughput
 compared to the set up on the NVIDIA A100 (80 GB), due to less available memory.
 
+#### NVIDIA V100
+
+There are known numerical issues with CUDA Capability 7.x devices. To work
+around the issue, set the ENV XLA_FLAGS to include
+`--xla_disable_hlo_passes=custom-kernel-fusion-rewriter`.
+
+With the above flag set, AlphaFold 3 can run on inputs of size up to 1,280
+tokens on a single NVIDIA V100 using [unified memory](#unified-memory).
+
 #### NVIDIA P100
 
 AlphaFold 3 can run on inputs of size up to 1,024 tokens on a single NVIDIA P100
 with no configuration changes needed.
 
-#### NVIDIA V100
-
-There are known issues with V100 devices. See
-[this Issue](https://github.com/google-deepmind/alphafold3/issues/59) for
-tracking.
-
 #### Other devices
 
-There are known issues with CUDA Capability 7.x devices. See
-[this Issue](https://github.com/google-deepmind/alphafold3/issues/59) for
-tracking.
+Large-scale numerical tests have not been performed on any other devices but
+they are believed to be numerically accurate.
 
-CUDA Capability 6.x and 8.x devices other than those listed explicitly here are
-believed to work for AlphaFold 3, but large-scale testing has only been
-performed for the devices mentioned above.
+There are known numerical issues with CUDA Capability 7.x devices. To work
+around the issue, set the environment variable `XLA_FLAGS` to include
+`--xla_disable_hlo_passes=custom-kernel-fusion-rewriter`.
 
 ## Compilation Buckets
 
@@ -166,6 +168,17 @@ in the provided `Dockerfile`).
 ENV XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
 ```
 
+### CUDA Capability 7.x GPUs
+
+For all CUDA Capability 7.x GPUs (e.g. V100) the environment variable
+`XLA_FLAGS` must be changed to include
+`--xla_disable_hlo_passes=custom-kernel-fusion-rewriter`. Disabling the Tritron
+GEMM kernels is not necessary as they are not supported for such GPUs.
+
+```sh
+ENV XLA_FLAGS="--xla_disable_hlo_passes=custom-kernel-fusion-rewriter"
+```
+
 ### GPU Memory
 
 The following environment variables (set by default in the `Dockerfile`) enable

diff --git a/run_alphafold.py b/run_alphafold.py
@@ -645,13 +645,14 @@ def main(_):
             ' https://developer.nvidia.com/cuda-gpus).'
         )
       elif 7.0 <= compute_capability < 8.0:
-        raise ValueError(
-            'There are currently known unresolved numerical issues with using'
-            ' devices with GPU compute capability 7.x (see'
-            ' https://developer.nvidia.com/cuda-gpus). Follow '
-            ' https://github.com/google-deepmind/alphafold3/issues/59 for'
-            ' tracking.'
-        )
+        xla_flags = os.environ.get('XLA_FLAGS')
+        required_flag = '--xla_disable_hlo_passes=custom-kernel-fusion-rewriter'
+        if not xla_flags or required_flag not in xla_flags:
+          raise ValueError(
+              'For devices with GPU compute capability 7.x (see'
+              ' https://developer.nvidia.com/cuda-gpus) the ENV XLA_FLAGS must'
+              f' include "{required_flag}".'
+          )
 
   notice = textwrap.wrap(
       'Running AlphaFold 3. Please note that standard AlphaFold 3 model'

diff --git a/src/alphafold3/test_data/alphafold_run_outputs/run_alphafold_test_output_bucket_1024.pkl b/src/alphafold3/test_data/alphafold_run_outputs/run_alphafold_test_output_bucket_1024.pkl
diff --git a/src/alphafold3/test_data/alphafold_run_outputs/run_alphafold_test_output_bucket_default.pkl b/src/alphafold3/test_data/alphafold_run_outputs/run_alphafold_test_output_bucket_default.pkl