Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEADLINE_EXCEEDED when running train.py on GPU node #20

Open
zigzagcai opened this issue Nov 24, 2023 · 0 comments
Open

DEADLINE_EXCEEDED when running train.py on GPU node #20

zigzagcai opened this issue Nov 24, 2023 · 0 comments

Comments

@zigzagcai
Copy link

zigzagcai commented Nov 24, 2023

Hi Ayaka,

Very excited to see llama2 built with JAX!
I am trying to run llama2-7b-hf with the provided framework on A800 cluster. But when I follow the readme docs and run python train.py, it reports jaxlib.xla_extension.XlaRuntimeError. Could you please provide some insights about how to fix that? Thanks!

Error Msg:

Traceback (most recent call last):
  File "/root/llama-2-jax/train.py", line 146, in <module>
    main()
  File "/root/llama-2-jax/train.py", line 89, in main
    jax.distributed.initialize(coordinator_address="localhost", num_processes=8, process_id=0)
  File "/root/miniconda3/envs/jax/lib/python3.11/site-packages/jax/_src/distributed.py", line 180, in initialize
    global_state.initialize(coordinator_address, num_processes, process_id,
  File "/root/miniconda3/envs/jax/lib/python3.11/site-packages/jax/_src/distributed.py", line 95, in initialize
    self.client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: PjRT_Client_Connect. Timed out task names:
/job:jax_worker/replica:0/task:1
/job:jax_worker/replica:0/task:6
/job:jax_worker/replica:0/task:3
/job:jax_worker/replica:0/task:2
/job:jax_worker/replica:0/task:5
/job:jax_worker/replica:0/task:7
/job:jax_worker/replica:0/task:4

Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Barrier:
:{"created":"@1700792772.792587791","description":"Error received from peer ipv4:127.0.0.1:443","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Barrier timed out. Barrier_id: PjRT_Client_Connect. Timed out task names:\n/job:jax_worker/replica:0/task:1\n/job:jax_worker/replica:0/task:6\n/job:jax_worker/replica:0/task:3\n/job:jax_worker/replica:0/task:2\n/job:jax_worker/replica:0/task:5\n/job:jax_worker/replica:0/task:7\n/job:jax_worker/replica:0/task:4\n","grpc_status":4}
2023-11-24 02:26:13.167769: E external/tsl/tsl/distributed_runtime/coordination/coordination_service_agent.cc:494] Failed to disconnect from coordination service with status: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/ShutdownTask:
:{"created":"@1700792773.167722780","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3940,"referenced_errors":[{"created":"@1700792773.167720845","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":392,"grpc_status":14}]}
Proceeding with agent shutdown anyway. This is usually caused by an earlier error during execution. Check the logs (this task or the leader) for an earlier error to debug further.

Software details:

  • CUDA 11.8
  • NVIDIA Driver 470.161.03
Package                  Version
------------------------ ---------------------
absl-py                  2.0.0
accelerate               0.24.1
appdirs                  1.4.4
certifi                  2022.12.7
charset-normalizer       2.1.1
chex                     0.1.84
click                    8.1.7
contourpy                1.2.0
cycler                   0.12.1
docker-pycreds           0.4.0
einops                   0.7.0
etils                    1.5.2
filelock                 3.9.0
fire                     0.5.0
flax                     0.7.5
fonttools                4.45.1
fsspec                   2023.10.0
gitdb                    4.0.11
GitPython                3.1.40
huggingface-hub          0.19.4
idna                     3.4
importlib-resources      6.1.1
jax                      0.4.20
jax-smi                  1.0.3
jaxlib                   0.4.20+cuda11.cudnn86
Jinja2                   3.1.2
kiwisolver               1.4.5
Mako                     1.3.0
Markdown                 3.5.1
markdown-it-py           3.0.0
MarkupSafe               2.1.3
matplotlib               3.8.2
mdurl                    0.1.2
ml-dtypes                0.3.1
mpmath                   1.3.0
msgpack                  1.0.7
mypy                     1.7.0
mypy-extensions          1.0.0
nest-asyncio             1.5.8
networkx                 3.0
numpy                    1.26.2
nvidia-cublas-cu11       11.11.3.6
nvidia-cuda-cupti-cu11   11.8.87
nvidia-cuda-nvcc-cu11    11.8.89
nvidia-cuda-nvrtc-cu11   11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cudnn-cu11        8.9.6.50
nvidia-cufft-cu11        10.9.0.58
nvidia-cusolver-cu11     11.4.1.48
nvidia-cusparse-cu11     11.7.5.86
nvidia-nccl-cu11         2.19.3
opt-einsum               3.3.0
optax                    0.1.7
orbax-checkpoint         0.4.3
packaging                23.2
pdoc3                    0.10.0
Pillow                   9.3.0
pip                      23.3.1
protobuf                 3.20.3
psutil                   5.9.6
Pygments                 2.17.2
pyparsing                3.1.1
python-dateutil          2.8.2
PyYAML                   6.0.1
regex                    2023.10.3
requests                 2.28.1
rich                     13.7.0
safetensors              0.4.0
scipy                    1.11.4
sentencepiece            0.1.99
sentry-sdk               1.36.0
setproctitle             1.3.3
setuptools               68.0.0
six                      1.16.0
smmap                    5.0.1
sympy                    1.12
tensorstore              0.1.50
termcolor                2.3.0
tokenizers               0.15.0
toolz                    0.12.0
torch                    2.1.1+cpu
torchaudio               2.1.1+cpu
torchvision              0.16.1+cpu
tqdm                     4.66.1
transformers             4.35.2
types-tqdm               4.66.0.4
typing_extensions        4.4.0
urllib3                  1.26.13
wandb                    0.16.0
wheel                    0.41.2
zipp                     3.17.0

Hardware details:

  • single node with 8GPUs (8*NVIDIA A800 GPU)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant