Develop upstream sync 241210 #2783

cj401-amd · 2024-12-12T10:46:25Z

Weekly sync from 12.10 with the upstream.

Adds a new boolean `xla_dump_hlo_unoptimized_snapshots` to the `DebugOptions` protobuf. When enabled, we'll dump an `HloUnoptimizedSnapshot` for each execution of an HLO module. This option only affects GPU targets for now. PiperOrigin-RevId: 703009410

PiperOrigin-RevId: 703011765

PiperOrigin-RevId: 703016890

PiperOrigin-RevId: 703016903

PiperOrigin-RevId: 703019838

PiperOrigin-RevId: 703025710

PiperOrigin-RevId: 703026703

PiperOrigin-RevId: 703028747

PiperOrigin-RevId: 703031030

…en HostOffloadLegalize moves copies out of host-memory-only offloading. PiperOrigin-RevId: 703033475

PiperOrigin-RevId: 703033719

PiperOrigin-RevId: 703035110

… log Imported from GitHub PR openxla/xla#19913 Error started occurring from this commit for exp openxla/xla@6e9eefe (originally introduced here openxla/xla@9b19353#diff-61ab646c9c3b8b0fc5ed1e9a62f535e9df5843adddd071250343f3bec48eacb6) and from this one openxla/xla@53d5338 for log. Trying to compile following MLIR code: ``` HloModule module ENTRY main { p0 = bf16[4] parameter(0) ROOT exp = bf16[4] exp(p0) } ``` would result in: ``` UNKNOWN: <unknown>:0: error: loc(callsite("wrapped_exponential" at "wrapped_exponential")): failed to legalize operation 'math.exp' <unknown>:0: note: loc("wrapped_exponential"): called from <unknown>:0: note: loc(callsite("wrapped_exponential" at "wrapped_exponential")): see current operation: %7 = "math.exp"(%6) <{fastmath = #arith.fastmath<afn>}> : (bf16) -> bf16 ``` Copybara import of the project: -- 616c10b5308cb827c593a89455fea4b772d6e870 by Milica Makevic <[email protected]>: Do not use fast approximation for exp and log for ROCm -- 3fa4914f90458a0285deb8801c5689421f945fe4 by Milica Makevic <[email protected]>: Add unit test for log and exp lowering on ROCm Merging this change closes tensorflow#19913 PiperOrigin-RevId: 703035402

…utions. This extends the custom algorithm to cover 2D cases. Benchmarks show about 50 times better performance than the generic algorithm, detailed results: name old cpu/op new cpu/op delta BM_Conv2DStrided/process_time 35.2ms ± 9% 34.3ms ± 6% ~ (p=0.690 n=5+5) BM_Conv2DTransposedStrided/process_time 8.25s ± 8% 0.03s ± 3% -99.62% (p=0.008 n=5+5) name old time/op new time/op delta BM_Conv2DStrided/process_time 3.06ms ±19% 2.88ms ± 6% ~ (p=0.421 n=5+5) BM_Conv2DTransposedStrided/process_time 415ms ±12% 9ms ± 4% -97.93% (p=0.008 n=5+5) Planned improvements of this algorithm: - support feature_group_size > 1 (grouped convolution), - parallel packing of the patches (second algorithm step), - support the case with multiple input channels and output channels at the same time, - explore input kernel rotation possibilities & perf impact, PiperOrigin-RevId: 703036601

Updates LLVM usage to match [71ac1eb50955](llvm/llvm-project@71ac1eb50955) PiperOrigin-RevId: 703048823

PiperOrigin-RevId: 703050199

…n in GEMM Rewriter Imported from GitHub PR openxla/xla#20153 Removes collectives from the set of ops that can be exchanged with dequantization in the GEMM rewriter. Copybara import of the project: -- e2efa84143fe30c5c6b25132831a62707c2a8f75 by Philipp Hack <[email protected]>: Removes collectives from the set of ops exchanged with dequantization in the GEMM rewriter. Merging this change closes tensorflow#20153 PiperOrigin-RevId: 703051850

PiperOrigin-RevId: 703052110

…peration with oneDNN primitives Imported from GitHub PR openxla/xla#18616 This PR refactors the code that fuses add operation to matmul / convolution primitives. It removes usage of macros and separate templatized handlers for matmul and convolution cases. Copybara import of the project: -- 68bcdf81a47fb0f753d837c034931094c5cd8017 by Akhil Goel <[email protected]>: Refactor Add Handler -- 462890bb75f2fcea3fdc5966bfa7a2b8f94b255a by Akhil Goel <[email protected]>: Address review comments Merging this change closes tensorflow#18616 PiperOrigin-RevId: 703054496

PiperOrigin-RevId: 703063087

Add the dtypes for which a CUB kernel is unavailable to the log output. PiperOrigin-RevId: 703067645

PiperOrigin-RevId: 703074301

PiperOrigin-RevId: 703076882

…dent ops. PiperOrigin-RevId: 703082625

`xla::Compiler` manages instances of `Compiler` that are registered statically. `StreamExecutorGpuClient` used to get its `Compiler` instance during static initialization which might fail if the `Compiler` instance gets registered later. As a fix we will get the needed `Compiler` instance during every compilation call. PiperOrigin-RevId: 703087406

PiperOrigin-RevId: 703091758

PiperOrigin-RevId: 703092539

Updates LLVM usage to match [dd7a3d4d798e](llvm/llvm-project@dd7a3d4d798e) PiperOrigin-RevId: 703100529

PiperOrigin-RevId: 703109195

i-chaochen · 2024-12-16T14:19:37Z

is this weekly-sync no problem now?

i-chaochen

have you followed the insturctions https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/SYNC_UPSTREAM.md
to do this check?

○ When all merge conflict resolved, do a grep -rn "<<<<<<" to make sure no diff symbols exist in the source

i-chaochen · 2024-12-16T14:59:16Z

it seems there are still 9 failed tests on gpu-pycpp, could you able to reproduce it?

14:47:24  //tensorflow/compiler/tests:matrix_diag_ops_test_gpu                     FAILED in 374.3s
14:47:24    /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/tests/matrix_diag_ops_test_gpu/test.log
14:47:24  //tensorflow/examples/adding_an_op:cuda_op_test                          FAILED in 5.5s
14:47:24    /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/examples/adding_an_op/cuda_op_test/test.log
14:47:24  //tensorflow/examples/custom_ops_doc/multiplex_2:multiplex_2_test_gpu    FAILED in 5.9s
14:47:24    /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/examples/custom_ops_doc/multiplex_2/multiplex_2_test_gpu/test.log
14:47:24  //tensorflow/python/distribute/integration_test:saved_model_test_gpu     FAILED in 6.1s
14:47:24    /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/distribute/integration_test/saved_model_test_gpu/test.log
14:47:24  //tensorflow/python/ops/numpy_ops:np_interop_test_gpu                    FAILED in 5.9s
14:47:24    /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/ops/numpy_ops/np_interop_test_gpu/test.log
14:47:24  //tensorflow/python/ops/numpy_ops/integration_test:np_config_test_gpu    FAILED in 6.1s
14:47:24    /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/ops/numpy_ops/integration_test/np_config_test_gpu/test.log
14:47:24  //tensorflow/python/ops/numpy_ops/tests:np_einsum_test                   FAILED in 5.3s
14:47:24    /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/ops/numpy_ops/tests/np_einsum_test/test.log
14:47:24  //tensorflow/python/profiler:profiler_client_test_gpu                    FAILED in 3.0s
14:47:24    /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/profiler/profiler_client_test_gpu/test.log
14:47:24  //tensorflow/tools/pip_package:import_api_packages_test                  FAILED in 5.2s
14:47:24    /root/.cache/bazel/_bazel_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/tools/pip_package/import_api_packages_test/test.log
14:47:24  
14:47:24  Executed 739 out of 739 tests: 730 tests pass and 9 fail locally.

cj401-amd · 2024-12-16T21:31:36Z

INFO: Analyzed target //tensorflow/compiler/tests:matrix_diag_ops_test_gpu (658 packages loaded, 49514 targets configured).
INFO: Found 1 test target...
Target //tensorflow/compiler/tests:matrix_diag_ops_test_gpu up-to-date:
  bazel-bin/tensorflow/compiler/tests/matrix_diag_ops_test_gpu
INFO: Elapsed time: 4621.159s, Critical Path: 3560.52s
INFO: 20345 processes: 434 internal, 19911 local.
INFO: Build completed successfully, 20345 total actions
//tensorflow/compiler/tests:matrix_diag_ops_test_gpu                     PASSED in 628.6s```

i-chaochen

it's SKIP subtests, not FIX. Could you create an issue to track these skipped ones?

pemeliya · 2024-12-19T10:24:11Z

Retest cpu-pycpp please

pemeliya · 2024-12-19T10:24:56Z

Rerun cpu-pycpp please

pemeliya · 2024-12-19T10:27:04Z

Rebuild cpu-pycpp please

…le of errors

…flow-upstream into develop-upstream-sync-241210

cj401-amd · 2025-01-03T09:59:46Z

//tensorflow/compiler/mlir/quantization/tensorflow/python:quantize_model_test FAILED in 49 out of 50 in 302.4s
//tensorflow/python/distribute/failure_handling:gce_failure_handler_test FAILED in 25 out of 32 in 301.5s
//tensorflow/python/data/experimental/kernel_tests/service:data_service_ops_test FAILED in 26 out of 32 in 73.0s
//tensorflow/python/data/kernel_tests:sample_from_datasets_test FAILED in 13 out of 24 in 99.3s
//tensorflow/python/data/kernel_tests:interleave_test FAILED in 9 out of 24 in 222.2s
//tensorflow/python/kernel_tests/nn_ops:embedding_ops_test_cpu FAILED in 13 out of 20 in 301.0s
//tensorflow/python/kernel_tests/math_ops:matmul_op_test_cpu FAILED in 19 out of 20 in 301.9s
//tensorflow/python/kernel_tests/math_ops:batch_matmul_op_test_cpu FAILED in 14 out of 20 in 301.0s
//tensorflow/python/kernel_tests/linalg:self_adjoint_eig_op_test_cpu FAILED in 14 out of 20 in 300.9s
//tensorflow/python/kernel_tests/linalg:qr_op_test_cpu FAILED in 20 out of 20 in 302.0s
//tensorflow/python/kernel_tests/linalg:normalize_op_test_cpu FAILED in 12 out of 20 in 301.3s
//tensorflow/python/kernel_tests/linalg:linalg_grad_test_cpu FAILED in 20 out of 20 in 301.4s
...

locally:

bazel --bazelrc=tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/cpu.bazelrc test \
 --config=sigbuild_local_cache \
 --verbose_failures \
 --disk_cache=/tf/cache \
 --config=pycpp \
 --config=rocm \
 --action_env=TF_PYTHON_VERSION=3.10 \
 --test_env=HIP_VISIBLE_DEVICES="" \
 --test_timeout=600 \
 --test_tag_filters=-no_cuda_on_cpu_tap,-no-gpu,-optimize.mlir.test,-requires-gpu-nvidia,-tpu,-v1only,-oss_serial,-no_windows,-no_oss \
 --local_test_jobs=256 ```

`
//tensorflow/tools/graph_transforms:transform_graph_test        (cached) PASSED in 12.8s
//tensorflow/tools/graph_transforms:transform_utils_test        (cached) PASSED in 12.9s
//tensorflow/tools/graph_transforms:transforms_test             (cached) PASSED in 16.1s
//tensorflow/tools/pip_package:import_api_packages_test         (cached) PASSED in 5.9s
//tensorflow/tools/pip_package:prebuilt_wheel_import_api_packages_test (cached) PASSED in 0.2s
//tensorflow/tools/proto_splitter:util_test                     (cached) PASSED in 12.3s
//tensorflow/tools/proto_splitter/python:test_util_test         (cached) PASSED in 10.3s
//tensorflow/tools/proto_text:gen_proto_text_functions_lib_test (cached) PASSED in 2.6s
//tensorflow/tools/tensorflow_builder/compat_checker:compat_checker_test (cached) PASSED in 0.4s

Executed 0 out of 2126 tests: 2126 tests pass.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see wh\
ich ones these are.`

i-chaochen · 2025-01-07T11:19:52Z

Hi @cj401-ai please create a task to record these skipped unit tests

02686ea
256194f

Aliia Khasanova and others added 30 commits December 5, 2024 00:47

Automated Code Change

fa13d03

PiperOrigin-RevId: 703011765

Update GraphDef version to 2067.

f1354b7

PiperOrigin-RevId: 703016890

compat: Update forward compatibility horizon to 2024-12-05

d037b7c

PiperOrigin-RevId: 703016903

Fix typos in documentation strings

db7cabf

[xla:collectives] NFC: Move all NCCL collectives to Collectives API

1e777c7

PiperOrigin-RevId: 703019838

Automated Code Change

687634c

PiperOrigin-RevId: 703025710

Merge pull request tensorflow#82141 from sandeepgupta12:master

b2a657c

PiperOrigin-RevId: 703026703

Automated Code Change

0ba5e64

PiperOrigin-RevId: 703028747

Reverts f13c441

fe29bc6

PiperOrigin-RevId: 703031030

Add support for bitcasts that add a degenerate majormost dimension wh…

3e1162e

…en HostOffloadLegalize moves copies out of host-memory-only offloading. PiperOrigin-RevId: 703033475

Automated Code Change

205b42c

PiperOrigin-RevId: 703033719

Update the TODO bug number in xla_triton_sparse_passes.cc.

19ad44a

PiperOrigin-RevId: 703035110

Integrate LLVM at llvm/llvm-project@71ac1eb50955

728087a

Updates LLVM usage to match [71ac1eb50955](llvm/llvm-project@71ac1eb50955) PiperOrigin-RevId: 703048823

Automated Code Change

9fae081

PiperOrigin-RevId: 703050199

Automated Code Change

d63ae75

PiperOrigin-RevId: 703052110

Automated Code Change

ebb9b42

PiperOrigin-RevId: 703063087

[XLA:GPU] Make VLOG explanation in SortRewriter more helpful.

2e39667

Add the dtypes for which a CUB kernel is unavailable to the log output. PiperOrigin-RevId: 703067645

Automated Code Change

4c3aa84

PiperOrigin-RevId: 703074301

Adding dumping functionality for HloUnoptimizedSnapshot.

d6ccc6a

PiperOrigin-RevId: 703076882

[XLA:GPU][Emitters] Create xla_ops dialect for the platform-indepen…

1c3c327

…dent ops. PiperOrigin-RevId: 703082625

Add support for Pad operation in Hhost_offload_utils::GetPredecessors().

f857329

PiperOrigin-RevId: 703091758

[XLA:GPU] Extend atomic_rmw to support vector updates.

f7e1750

PiperOrigin-RevId: 703092539

Integrate LLVM at llvm/llvm-project@dd7a3d4d798e

9ac393e

Updates LLVM usage to match [dd7a3d4d798e](llvm/llvm-project@dd7a3d4d798e) PiperOrigin-RevId: 703100529

Reverts fd9471e

ab5ec27

PiperOrigin-RevId: 703109195

i-chaochen requested review from mmakevic-amd, draganmladjenovic, i-chaochen and hsharsha December 16, 2024 14:21

i-chaochen reviewed Dec 16, 2024

View reviewed changes

cj401-amd and others added 8 commits December 16, 2024 22:40

update for CI building unit test

02686ea

update with cuda-only for passing unit-test

256194f

update for unit test passed locally and ci building

e60b72b

update for skip command_buffer_cmd_test locally tested

7005763

fix for CI building

fd54d81

Properly skip topk_specializer_test for ROCm

b74d59c

Remove unnecessary backend for gpu_triton_custom_call_test

72a9fb8

fix XLA related unit tests with locally passed

d71ec61

i-chaochen reviewed Dec 19, 2024

View reviewed changes

cj401-amd and others added 3 commits December 19, 2024 22:45

update for pycpp test

f22b6c3

_pywrap_profiler_plugin seems have been messed up somewhere, as a cyc…

2277278

…le of errors

Fix dependency issues of profiler plugin

b1e817d

mmakevic-amd force-pushed the develop-upstream-sync-241210 branch from 2277278 to b1e817d Compare December 30, 2024 04:24

Merge branch 'develop-upstream-sync-241210' of github.com:ROCm/tensor…

34e1cc7

…flow-upstream into develop-upstream-sync-241210

cj401-amd added 2 commits January 5, 2025 23:19

update for versions_test

b45daeb

update profiler build

f84523c

cj401-amd merged commit 03daf31 into develop-upstream Jan 7, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop upstream sync 241210 #2783

Develop upstream sync 241210 #2783

cj401-amd commented Dec 12, 2024

i-chaochen commented Dec 16, 2024

i-chaochen left a comment

i-chaochen commented Dec 16, 2024

cj401-amd commented Dec 16, 2024

i-chaochen left a comment

pemeliya commented Dec 19, 2024

pemeliya commented Dec 19, 2024

pemeliya commented Dec 19, 2024

cj401-amd commented Jan 3, 2025 •

edited

Loading

i-chaochen commented Jan 7, 2025

Develop upstream sync 241210 #2783

Develop upstream sync 241210 #2783

Conversation

cj401-amd commented Dec 12, 2024

i-chaochen commented Dec 16, 2024

i-chaochen left a comment

Choose a reason for hiding this comment

i-chaochen commented Dec 16, 2024

cj401-amd commented Dec 16, 2024

i-chaochen left a comment

Choose a reason for hiding this comment

pemeliya commented Dec 19, 2024

pemeliya commented Dec 19, 2024

pemeliya commented Dec 19, 2024

cj401-amd commented Jan 3, 2025 • edited Loading

i-chaochen commented Jan 7, 2025

cj401-amd commented Jan 3, 2025 •

edited

Loading