CMake build support #1

joker-eph · 2022-08-09T16:32:52Z

XLA builds with Bazel at the moment, is it desirable to also have a CMake build?
Who would benefit from this and what workflow will this enable that aren't doable or easy with the current Bazel configuration?

etherzhhb · 2022-08-09T17:18:03Z

May be not a strong reason but: In some teams all of they software packages are built with CMake, except TF/XLA. It will make it easier for the team manage the build process of its software packages, if XLA is able to also provide CMake build support.

stevecapperarm · 2022-08-11T13:50:27Z

Hello,
Bazel has made things tricky for folk in a couple of ways:

Assumption of network access. Distro build daemons typically do not build with network access and some corporate systems have restricted network access, working around this typically involves fetching assets separately then "preloading" them in a cache just before build,
Multiple versions of Bazel needed. Distros have to carry around a few different versions of Bazel, each one to match up with the particular projects (newer versions of Bazel have not always been backwards compatible with older projects).

It would be incredibly helpful for distros (and others I suspect) to have cmake support, to fall back to.

bhack · 2022-09-10T13:30:10Z

Is this still actively maintained?

https://github.com/google/bazel-to-cmake

makslevental · 2023-03-14T23:53:18Z

Is this still actively maintained?

https://github.com/google/bazel-to-cmake

fwiw such a script exists inside iree https://github.com/openxla/iree/tree/main/build_tools/bazel_to_cmake

Imported from GitHub PR #6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue##6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da by shuw <[email protected]>: Add FP8 fast accumulation support for cublasLt. -- 9684568 by shuw <[email protected]>: Improve based on review #1 -- e906d76 by shuw <[email protected]>: Improve based on review #2 Merging this change closes #6599 FUTURE_COPYBARA_INTEGRATE_REVIEW=#6599 from wenscarl:fp8_fast_accumulation e906d76 PiperOrigin-RevId: 578904075

Imported from GitHub PR #6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue##6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da by shuw <[email protected]>: Add FP8 fast accumulation support for cublasLt. -- 9684568 by shuw <[email protected]>: Improve based on review #1 -- e906d76 by shuw <[email protected]>: Improve based on review #2 Merging this change closes #6599 COPYBARA_INTEGRATE_REVIEW=#6599 from wenscarl:fp8_fast_accumulation e906d76 PiperOrigin-RevId: 578948593

Imported from GitHub PR #7751 Due to fast accumulation being turned on in the forward mode, the cublasLt fp8 gemm with gelu epilogue can efficiently operate with a fused kernel. Compared against the XLA-generated gelu kernel on H100, the performance demonstrates some improvement for size of [8192, 4096] x [4096, 16384] + gelu: Execution time for matmul using cublasLt and gelu (XLA): 1.28ms Execution time for matmul_gelu using cublasLt: 1.25ms Copybara import of the project: -- e8abce3 by Shu Wang <[email protected]>: Support cublasLt Fp8 Approx Gelu epilogue fusion. -- 818127c by shuw <[email protected]>: Remove F32 check -- 5ce3108 by shuw <[email protected]>: Improve based on review #1 Merging this change closes #7751 COPYBARA_INTEGRATE_REVIEW=#7751 from wenscarl:cublaslt_fp8_gelu 5ce3108 PiperOrigin-RevId: 591236441

SomeoneSerge · 2024-02-14T20:00:16Z

I recall being very happy to see this to be the issue #1 🙃

is it desirable to also have a CMake build?

In short, I'd suggest "yes, from the point of view of (package) distributions", because Bazel doesn't support/maybe even hinders dependency injection (in the broad sense, thinking e.g. of cmake's find_package), which is important for achieving consistent package sets/applying patches consistently. For this to be useful in practice, however, one'd also need downstream projects like tensorflow to use xla in a way that permits substitution

joker-eph · 2024-02-14T21:09:20Z

Another thing is the efficiency of the build: Bazel is building the code twice (or thrice sometimes) because of host/target split model. CMake does not suffer from this restriction, so a CMake build on my workstation could be easily twice as fast.

It is also much easier to setup things like "use my prebuilt LLVM" which helps again having much fast build since you only build the XLA code and not "the world".

(Docker containers can be used for dependency and controlling the environment, instead of relying on Bazel for making everything hermetic, at a high cost)

…execution scope Instead of always constructing while operation conditional in the default scope use the scope of a while operation itself. This generates correct CUDA graph: https://gist.github.com/ezhulenev/a84192fe8b46a4bf1a934a8baa08ea60 Memeset operation launched in a scope #1 is not synchronized with initial condition handle update PiperOrigin-RevId: 609475974

…execution scope Instead of always constructing while operation conditional in the default scope use the scope of a while operation itself. This generates correct CUDA graph: https://gist.github.com/ezhulenev/a84192fe8b46a4bf1a934a8baa08ea60 Memeset operation launched in a scope #1 is not synchronized with initial condition handle update PiperOrigin-RevId: 609742672

joelberkeley · 2024-03-24T20:56:13Z

fwiw i'm finding bazel extremely difficult to use, and i'm not totally inept. A look on the internet suggests it's a common problem. I'm very very much in favour of a simple build tool. I don't know if CMake is that, never used it. I imagine this would be appreciated more widely for an opensource project.

PiperOrigin-RevId: 629795760

PiperOrigin-RevId: 629853006

… to Initialize() Imported from GitHub PR #12228 The first time that a NormThunk is executed, it will build a cudnn execution plan. This build step can hang if a NCCL collective is running at the same time. To fix this, I've moved the build step to take place during thunk initialization. We only observe this hang when using cudnn 9. Here's a backtrace from the hang that will be fixed: ``` Thread 585 (Thread 0x7fb9391ff640 (LWP 41364) "main.py"): #0 0x00007fd3d17cffd9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fd3d17da24f in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fd070967dfe in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #3 0x00007fd0709c928a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #4 0x00007f1970d76102 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #5 0x00007f1970f2c999 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #6 0x00007f1970a7d4ab in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #7 0x00007f1970d0a9cb in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #8 0x00007fce60b2a98c in cudnn::backend::ExecutionPlan::finalize_internal() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #9 0x00007fce60aefbb1 in cudnn::backend::Descriptor::finalize() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #10 0x00007fce60b15bec in cudnnBackendFinalize () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #11 0x00007fd2521b8f39 in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #12 0x00007fd2521734ba in stream_executor::gpu::(anonymous namespace)::GetExecPlanFromHeuristics(cudnn_frontend::OperationGraph_v8&&, stream_executor::gpu::(anonymous namespace)::CudnnHandle const&, bool) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #13 0x00007fd25216ff9b in stream_executor::gpu::CudnnSupport::NormRunnerFromDesc(stream_executor::Stream*, stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormKind, double, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #14 0x00007fd24e36b88b in stream_executor::dnn::NormOp::RunnerFromAlgorithmDesc(stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #15 0x00007fd24e36ae37 in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}::operator()() const () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #16 0x00007fd24e36adbc in void absl::lts_20230802::base_internal::CallOnceImpl<stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}>(std::atomic<unsigned int>*, absl::lts_20230802::base_internal::SchedulingMode, stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}&&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #17 0x00007fd24e36a9bd in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #18 0x00007fd24e369d29 in xla::gpu::RunGpuNorm(xla::gpu::GpuNormConfig const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, stream_executor::DeviceMemoryBase const&, stream_executor::Stream*, xla::gpu::RunNormOptions) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #19 0x00007fd24e368be6 in xla::gpu::NormThunk::ExecuteOnStream(xla::gpu::Thunk::ExecuteParams const&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so ``` Copybara import of the project: -- f535330 by Trevor Morris <[email protected]>: Fix hang with cudnn layer norm by moving cudnn init to Initialize() Merging this change closes #12228 COPYBARA_INTEGRATE_REVIEW=#12228 from trevor-m:tmorris-norm-init f535330 PiperOrigin-RevId: 633220207

name old cpu/op new cpu/op delta BM_SelectAndScatterF32/128/process_time 889µs ± 1% 740µs ± 3% -16.70% BM_SelectAndScatterF32/256/process_time 3.64ms ± 2% 3.00ms ± 1% -17.64% BM_SelectAndScatterF32/512/process_time 15.3ms ± 1% 13.1ms ± 3% -14.61% PiperOrigin-RevId: 657693426

name old cpu/op new cpu/op delta BM_SelectAndScatterF32/128/process_time 889µs ± 1% 740µs ± 3% -16.70% BM_SelectAndScatterF32/256/process_time 3.64ms ± 2% 3.00ms ± 1% -17.64% BM_SelectAndScatterF32/512/process_time 15.3ms ± 1% 13.1ms ± 3% -14.61% Reverts 9bb1871 PiperOrigin-RevId: 657693426

name old cpu/op new cpu/op delta BM_SelectAndScatterF32/128/process_time 889µs ± 1% 740µs ± 3% -16.70% BM_SelectAndScatterF32/256/process_time 3.64ms ± 2% 3.00ms ± 1% -17.64% BM_SelectAndScatterF32/512/process_time 15.3ms ± 1% 13.1ms ± 3% -14.61% PiperOrigin-RevId: 657693426

name old cpu/op new cpu/op delta BM_SelectAndScatterF32/128/process_time 889µs ± 1% 740µs ± 3% -16.70% BM_SelectAndScatterF32/256/process_time 3.64ms ± 2% 3.00ms ± 1% -17.64% BM_SelectAndScatterF32/512/process_time 15.3ms ± 1% 13.1ms ± 3% -14.61% PiperOrigin-RevId: 658063846

joker-eph added the enhancement New feature or request label Aug 9, 2022

SomeoneSerge mentioned this issue Apr 2, 2023

Package request: python3Packages.openxla NixOS/nixpkgs#224354

Open

wenscarl added a commit to wenscarl/xla that referenced this issue Oct 30, 2023

Improve based on review openxla#1

9684568

copybara-service bot mentioned this issue Nov 2, 2023

PR #6599: Fp8 Fast Accumulation support for cublasLt #6730

Closed

wenscarl added a commit to wenscarl/xla that referenced this issue Nov 15, 2023

Improve based on review openxla#1

4fd004f

wenscarl added a commit to wenscarl/xla that referenced this issue Nov 15, 2023

Improve based on review openxla#1

a4a2a8e

wenscarl added a commit to wenscarl/xla that referenced this issue Dec 14, 2023

Improve based on review openxla#1

5ce3108

copybara-service bot mentioned this issue Feb 23, 2024

[stream_executor] Construct while operation conditional in the right execution scope #9895

Merged

MuYu-zhi mentioned this issue Mar 15, 2024

build failed with undefined reference problem #10592

Open

wenscarl added a commit to wenscarl/xla that referenced this issue Apr 1, 2024

Improve based on review openxla#1.

30ad869

wenscarl added a commit to wenscarl/xla that referenced this issue Apr 1, 2024

Improve based on review openxla#1.

2284394

wenscarl added a commit to wenscarl/xla that referenced this issue Apr 1, 2024

Improve based on review openxla#1.

41c5f64

wenscarl added a commit to wenscarl/xla that referenced this issue Apr 1, 2024

Improve based on review openxla#1.

6d389dc

copybara-service bot pushed a commit that referenced this issue May 1, 2024

[xla:cpu] NFC: Remove deprecated XLA:CPU mlir based codegen part #1

0fae1d8

PiperOrigin-RevId: 629795760

copybara-service bot mentioned this issue May 1, 2024

[xla:cpu] NFC: Remove deprecated XLA:CPU mlir based codegen part #1 #12029

Merged

copybara-service bot pushed a commit that referenced this issue May 1, 2024

[xla:cpu] NFC: Remove deprecated XLA:CPU mlir based codegen part #1

c3366f8

PiperOrigin-RevId: 629853006

copybara-service bot mentioned this issue Jul 31, 2024

[xla:cpu] Optimize ThunkExecutor::Execute part #1 #15546

Closed

johnnynunez mentioned this issue Sep 13, 2024

add jetson packages #16905

Closed

akuegel mentioned this issue Oct 17, 2024

[NVIDIA] Complete the optimization of deterministic scatter operations #18326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMake build support #1

CMake build support #1

joker-eph commented Aug 9, 2022

etherzhhb commented Aug 9, 2022 •

edited

Loading

stevecapperarm commented Aug 11, 2022

bhack commented Sep 10, 2022

makslevental commented Mar 14, 2023

SomeoneSerge commented Feb 14, 2024

joker-eph commented Feb 14, 2024 •

edited

Loading

joelberkeley commented Mar 24, 2024

CMake build support #1

CMake build support #1

Comments

joker-eph commented Aug 9, 2022

etherzhhb commented Aug 9, 2022 • edited Loading

stevecapperarm commented Aug 11, 2022

bhack commented Sep 10, 2022

makslevental commented Mar 14, 2023

SomeoneSerge commented Feb 14, 2024

joker-eph commented Feb 14, 2024 • edited Loading

joelberkeley commented Mar 24, 2024

etherzhhb commented Aug 9, 2022 •

edited

Loading

joker-eph commented Feb 14, 2024 •

edited

Loading