Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change --disable_nccl to --enable_nccl. #10

Conversation

xinyazhang
Copy link

Description

Change --disable_nccl to --enable_nccl, and disable NCCL build by default

Motivation and Context

The Upstream disabled NCCL by default and this is to match upstream's
behavior in case we failed to enable NCCL in CI. See https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/838#issuecomment-1583204610 for more details.

The change of the default build option is the potential causation of SWDEV-396251 and SWDEV-396644.

CAVEAT: this may break quite a few CIs that use if they are using --disable_nccl OR expecting NCCL being built by default

The Upstream disabled NCCL by default and this is to match upstream's
behavior in case we failed to enable NCCL in CI.

The change of the default build option is the potential causation of
SWDEV-396251 and SWDEV-396644.
@xinyazhang xinyazhang requested a review from amathews-amd June 8, 2023 20:45
@amathews-amd amathews-amd requested a review from jeffdaily June 8, 2023 21:33
Copy link

@amathews-amd amathews-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xinyazhang xinyazhang merged commit 5810fad into rocm5.5_internal_testing Jun 9, 2023
groenenboomj pushed a commit that referenced this pull request Sep 26, 2023
### Description
Release OrtEnv before main function returns. Before this change, OrtEnv
is deleted when C/C++ runtime destructs all global variables in ONNX
Runtime's core framework.
The callstack is like this:
```
  * frame #0: 0x00007fffee39f5a6 libonnxruntime.so.1.16.0`onnxruntime::Environment::~Environment(this=0x00007fffee39fbf2) at environment.h:20:7
    frame #1: 0x00007fffee39f614 libonnxruntime.so.1.16.0`std::default_delete<onnxruntime::Environment>::operator()(this=0x00007ffff4c30e50, __ptr=0x0000000005404b00) const at unique_ptr.h:85:2
    frame #2: 0x00007fffee39edca libonnxruntime.so.1.16.0`std::unique_ptr<onnxruntime::Environment, std::default_delete<onnxruntime::Environment>>::~unique_ptr(this=0x5404b00) at unique_ptr.h:361:17
    frame #3: 0x00007fffee39e2ab libonnxruntime.so.1.16.0`OrtEnv::~OrtEnv(this=0x00007ffff4c30e50) at ort_env.cc:43:1
    frame #4: 0x00007fffee39fa96 libonnxruntime.so.1.16.0`std::default_delete<OrtEnv>::operator()(this=0x00007fffefff8f78, __ptr=0x00007ffff4c30e50) const at unique_ptr.h:85:2
    frame #5: 0x00007fffee39f394 libonnxruntime.so.1.16.0`std::unique_ptr<OrtEnv, std::default_delete<OrtEnv>>::~unique_ptr(this=0x7ffff4c30e50) at unique_ptr.h:361:17
    frame #6: 0x00007ffff78574b5 libc.so.6`__run_exit_handlers + 261
    frame #7: 0x00007ffff7857630 libc.so.6`exit + 32
    frame #8: 0x00007ffff783feb7 libc.so.6`__libc_start_call_main + 135
    frame #9: 0x00007ffff783ff60 libc.so.6`__libc_start_main@@GLIBC_2.34 + 128
    frame #10: 0x0000000000abbdee node`_start + 46
```
After this change, OrtEnv will be deleted before the main function
returns and nodejs is still alive.
jagadish-amd pushed a commit to jagadish-amd/onnxruntime that referenced this pull request Sep 10, 2024
… transient connection exceptions. (microsoft#21612)

### Description
Improve docker commands to make docker image layer caching works.
It can make docker building faster and more stable.
So far, A100 pool's system disk is too small to use docker cache.
We won't use pipeline cache for docker image and remove some legacy
code.

### Motivation and Context
There are often an exception of
```
64.58 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail
286.4 curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
```
Because Onnxruntime pipeline have been sending too many requests to
download Nodejs in docker building.
Which is the major reason of pipeline failing now

In fact, docker image layer caching never works.
We can always see the scrips are still running
```
ROCm#9 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts
ROCm#9 0.234 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
ROCm#9 0.235 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
ROCm#9 0.235 /tmp/scripts/install_centos.sh: line 1: !/bin/bash: No such file or directory
ROCm#9 0.235 ++ '[' '!' -f /etc/yum.repos.d/microsoft-prod.repo ']'
ROCm#9 0.236 +++ tr -dc 0-9.
ROCm#9 0.236 +++ cut -d . -f1
ROCm#9 0.238 ++ os_major_version=8
....
ROCm#9 60.41 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail
ROCm#9 60.59 + return 0
...
```

This PR is improving the docker command to make image layer caching
work.
Thus, CI won't send so many redundant request of downloading NodeJS.
```
ROCm#9 [2/5] ADD scripts /tmp/scripts
ROCm#9 CACHED

ROCm#10 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts
ROCm#10 CACHED

ROCm#11 [4/5] RUN adduser --uid 1000 onnxruntimedev
ROCm#11 CACHED

ROCm#12 [5/5] WORKDIR /home/onnxruntimedev
ROCm#12 CACHED
```

###Reference
https://docs.docker.com/build/drivers/

---------

Co-authored-by: Yi Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants