forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change --disable_nccl to --enable_nccl. #10
Merged
xinyazhang
merged 1 commit into
rocm5.5_internal_testing
from
xinyazhang/rocm5.5_internal_testing-argparse_enable_nccl
Jun 9, 2023
Merged
Change --disable_nccl to --enable_nccl. #10
xinyazhang
merged 1 commit into
rocm5.5_internal_testing
from
xinyazhang/rocm5.5_internal_testing-argparse_enable_nccl
Jun 9, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The Upstream disabled NCCL by default and this is to match upstream's behavior in case we failed to enable NCCL in CI. The change of the default build option is the potential causation of SWDEV-396251 and SWDEV-396644.
amathews-amd
approved these changes
Jun 8, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
jeffdaily
approved these changes
Jun 8, 2023
groenenboomj
pushed a commit
that referenced
this pull request
Sep 26, 2023
### Description Release OrtEnv before main function returns. Before this change, OrtEnv is deleted when C/C++ runtime destructs all global variables in ONNX Runtime's core framework. The callstack is like this: ``` * frame #0: 0x00007fffee39f5a6 libonnxruntime.so.1.16.0`onnxruntime::Environment::~Environment(this=0x00007fffee39fbf2) at environment.h:20:7 frame #1: 0x00007fffee39f614 libonnxruntime.so.1.16.0`std::default_delete<onnxruntime::Environment>::operator()(this=0x00007ffff4c30e50, __ptr=0x0000000005404b00) const at unique_ptr.h:85:2 frame #2: 0x00007fffee39edca libonnxruntime.so.1.16.0`std::unique_ptr<onnxruntime::Environment, std::default_delete<onnxruntime::Environment>>::~unique_ptr(this=0x5404b00) at unique_ptr.h:361:17 frame #3: 0x00007fffee39e2ab libonnxruntime.so.1.16.0`OrtEnv::~OrtEnv(this=0x00007ffff4c30e50) at ort_env.cc:43:1 frame #4: 0x00007fffee39fa96 libonnxruntime.so.1.16.0`std::default_delete<OrtEnv>::operator()(this=0x00007fffefff8f78, __ptr=0x00007ffff4c30e50) const at unique_ptr.h:85:2 frame #5: 0x00007fffee39f394 libonnxruntime.so.1.16.0`std::unique_ptr<OrtEnv, std::default_delete<OrtEnv>>::~unique_ptr(this=0x7ffff4c30e50) at unique_ptr.h:361:17 frame #6: 0x00007ffff78574b5 libc.so.6`__run_exit_handlers + 261 frame #7: 0x00007ffff7857630 libc.so.6`exit + 32 frame #8: 0x00007ffff783feb7 libc.so.6`__libc_start_call_main + 135 frame #9: 0x00007ffff783ff60 libc.so.6`__libc_start_main@@GLIBC_2.34 + 128 frame #10: 0x0000000000abbdee node`_start + 46 ``` After this change, OrtEnv will be deleted before the main function returns and nodejs is still alive.
jagadish-amd
pushed a commit
to jagadish-amd/onnxruntime
that referenced
this pull request
Sep 10, 2024
… transient connection exceptions. (microsoft#21612) ### Description Improve docker commands to make docker image layer caching works. It can make docker building faster and more stable. So far, A100 pool's system disk is too small to use docker cache. We won't use pipeline cache for docker image and remove some legacy code. ### Motivation and Context There are often an exception of ``` 64.58 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail 286.4 curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2) ``` Because Onnxruntime pipeline have been sending too many requests to download Nodejs in docker building. Which is the major reason of pipeline failing now In fact, docker image layer caching never works. We can always see the scrips are still running ``` ROCm#9 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts ROCm#9 0.234 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) ROCm#9 0.235 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8) ROCm#9 0.235 /tmp/scripts/install_centos.sh: line 1: !/bin/bash: No such file or directory ROCm#9 0.235 ++ '[' '!' -f /etc/yum.repos.d/microsoft-prod.repo ']' ROCm#9 0.236 +++ tr -dc 0-9. ROCm#9 0.236 +++ cut -d . -f1 ROCm#9 0.238 ++ os_major_version=8 .... ROCm#9 60.41 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail ROCm#9 60.59 + return 0 ... ``` This PR is improving the docker command to make image layer caching work. Thus, CI won't send so many redundant request of downloading NodeJS. ``` ROCm#9 [2/5] ADD scripts /tmp/scripts ROCm#9 CACHED ROCm#10 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts ROCm#10 CACHED ROCm#11 [4/5] RUN adduser --uid 1000 onnxruntimedev ROCm#11 CACHED ROCm#12 [5/5] WORKDIR /home/onnxruntimedev ROCm#12 CACHED ``` ###Reference https://docs.docker.com/build/drivers/ --------- Co-authored-by: Yi Zhang <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Change --disable_nccl to --enable_nccl, and disable NCCL build by default
Motivation and Context
The Upstream disabled NCCL by default and this is to match upstream's
behavior in case we failed to enable NCCL in CI. See https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/838#issuecomment-1583204610 for more details.
The change of the default build option is the potential causation of SWDEV-396251 and SWDEV-396644.
CAVEAT: this may break quite a few CIs that use if they are using --disable_nccl OR expecting NCCL being built by default