[MIGraphX EP] Set External Data Path #21598

TedThemistokleous · 2024-08-02T16:12:19Z

Description

Changes to add in Set external data path for model weight files. Additional fixes to ensure this compiles off the latest v1.19 Onnxruntime

Motivation and Context

Separate weights used for larger models (like stable diffusion) is motivation for this change set

TedThemistokleous · 2024-08-02T16:15:33Z

ping @PeixuanZuo @cloudhan @ytaous @snnn

cloudhan · 2024-08-02T16:50:38Z

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline

cloudhan · 2024-08-02T16:51:03Z

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU TensorRT CI Pipeline,Windows x64 QNN CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

azure-pipelines · 2024-08-02T16:51:25Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-08-02T16:51:39Z

Azure Pipelines successfully started running 8 pipeline(s).

onnxruntime/core/providers/migraphx/migraphx_execution_provider.cc

tianleiwu · 2024-08-02T21:17:55Z

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline

tianleiwu · 2024-08-02T21:18:02Z

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU TensorRT CI Pipeline,Windows x64 QNN CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

azure-pipelines · 2024-08-02T21:18:29Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-08-02T21:18:31Z

Azure Pipelines successfully started running 8 pipeline(s).

tianleiwu · 2024-08-05T18:08:47Z

@TedThemistokleous,

I saw some build error in CI Pipleine:

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1453986&view=logs&j=7536d2cd-87d4-54fe-4891-bfbbf2741d83&t=66420422-c7d6-5f71-625c-4b7851c9b9ba&l=2591

/onnxruntime_src/onnxruntime/core/providers/migraphx/migraphx_execution_provider.cc:1302:23: error: ‘struct migraphx::api::onnx_options’ has no member named ‘set_external_data_path’

TedThemistokleous · 2024-08-06T03:35:40Z

You need to update to the latest MIGraphX that's been released.

This reverts commit 45b7c41.

### Description * Fix migraphx build error caused by #21598: Add a conditional compile on code block that depends on ROCm >= 6.2. Note that the pipeline uses ROCm 6.0. Unblock orttraining-linux-gpu-ci-pipeline and orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline pipelines: * Disable a model test in linux GPU training ci pipelines caused by #19470: Sometime, cudnn frontend throws exception that cudnn graph does not support a Conv node of keras_lotus_resnet3D model on V100 GPU. Note that same test does not throw exception in other GPU pipelines. The failure might be related to cudnn 8.9 and V100 GPU used in the pipeline (Amper GPUs and cuDNN 9.x do not have the issue). The actual fix requires fallback logic, which will take time to implement, so we temporarily disable the test in training pipelines. * Force install torch for cuda 11.8. (The docker has torch 2.4.0 for cuda 12.1 to build torch extension, which it is not compatible cuda 11.8). Note that this is temporary walkround. More elegant fix is to make sure right torch version in docker build step, that might need update install_python_deps.sh and corresponding requirements.txt. * Skip test_gradient_correctness_conv1d since it causes segment fault. Root cause need more investigation (maybe due to cudnn frontend as well). * Skip test_aten_attention since it causes assert failure. Root cause need more investigation (maybe due to torch version). * Skip orttraining_ortmodule_distributed_tests.py since it has error that compiler for torch extension does not support c++17. One possible fix it to set the following compile argument inside setup.py of extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17']. However, due to the urgency of unblocking the pipelines, just disable the test for now. * skip test_softmax_bf16_large. For some reason, torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so the test was run in CI, but V100 does not support bf16 natively. * Fix typo of deterministic ### Motivation and Context

* Fix migraphx build error caused by #21598: Add a conditional compile on code block that depends on ROCm >= 6.2. Note that the pipeline uses ROCm 6.0. Unblock orttraining-linux-gpu-ci-pipeline and orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline pipelines: * Disable a model test in linux GPU training ci pipelines caused by #19470: Sometime, cudnn frontend throws exception that cudnn graph does not support a Conv node of keras_lotus_resnet3D model on V100 GPU. Note that same test does not throw exception in other GPU pipelines. The failure might be related to cudnn 8.9 and V100 GPU used in the pipeline (Amper GPUs and cuDNN 9.x do not have the issue). The actual fix requires fallback logic, which will take time to implement, so we temporarily disable the test in training pipelines. * Force install torch for cuda 11.8. (The docker has torch 2.4.0 for cuda 12.1 to build torch extension, which it is not compatible cuda 11.8). Note that this is temporary walkround. More elegant fix is to make sure right torch version in docker build step, that might need update install_python_deps.sh and corresponding requirements.txt. * Skip test_gradient_correctness_conv1d since it causes segment fault. Root cause need more investigation (maybe due to cudnn frontend as well). * Skip test_aten_attention since it causes assert failure. Root cause need more investigation (maybe due to torch version). * Skip orttraining_ortmodule_distributed_tests.py since it has error that compiler for torch extension does not support c++17. One possible fix it to set the following compile argument inside setup.py of extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17']. However, due to the urgency of unblocking the pipelines, just disable the test for now. * skip test_softmax_bf16_large. For some reason, torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so the test was run in CI, but V100 does not support bf16 natively. * Fix typo of deterministic

jeffdaily and others added 2 commits August 2, 2024 15:51

[MIGraphX EP] Set external data path option

9310768

fix compilation for v1.19+

54362d7

TedThemistokleous changed the title ~~MIGraphX Set External Data Path~~ [MIGraphX EP] Set External Data Path Aug 2, 2024

tianleiwu reviewed Aug 2, 2024

View reviewed changes

onnxruntime/core/providers/migraphx/migraphx_execution_provider.cc Outdated Show resolved Hide resolved

TedThemistokleous force-pushed the migx_ep_set_ext_data_path branch from 00e594f to 742140c Compare August 2, 2024 19:47

Handle case with missing parent path so we use current path

c31472d

TedThemistokleous force-pushed the migx_ep_set_ext_data_path branch from 742140c to c31472d Compare August 2, 2024 19:52

TedThemistokleous requested a review from tianleiwu August 2, 2024 19:52

tianleiwu approved these changes Aug 2, 2024

View reviewed changes

tianleiwu merged commit 45b7c41 into microsoft:main Aug 2, 2024
62 checks passed

TedThemistokleous deleted the migx_ep_set_ext_data_path branch August 3, 2024 03:40

PeixuanZuo added a commit that referenced this pull request Aug 7, 2024

Revert "[MIGraphX EP] Set External Data Path (#21598)"

b52e7e4

This reverts commit 45b7c41.

PeixuanZuo mentioned this pull request Aug 7, 2024

Revert "[MIGraphX EP] Set External Data Path" #21649

Closed

tianleiwu mentioned this pull request Aug 7, 2024

Unblock migraphx and linux GPU training ci pipelines #21662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MIGraphX EP] Set External Data Path #21598

[MIGraphX EP] Set External Data Path #21598

TedThemistokleous commented Aug 2, 2024 •

edited

Loading

TedThemistokleous commented Aug 2, 2024

cloudhan commented Aug 2, 2024

cloudhan commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

tianleiwu commented Aug 2, 2024

tianleiwu commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

tianleiwu commented Aug 5, 2024

TedThemistokleous commented Aug 6, 2024

[MIGraphX EP] Set External Data Path #21598

[MIGraphX EP] Set External Data Path #21598

Conversation

TedThemistokleous commented Aug 2, 2024 • edited Loading

Description

Motivation and Context

TedThemistokleous commented Aug 2, 2024

cloudhan commented Aug 2, 2024

cloudhan commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

tianleiwu commented Aug 2, 2024

tianleiwu commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

tianleiwu commented Aug 5, 2024

TedThemistokleous commented Aug 6, 2024

TedThemistokleous commented Aug 2, 2024 •

edited

Loading