Confused: How to deploy training task in multi nodes with accelerate and deepspeed? #1288
Unanswered
xiechengmude
asked this question in
Q&A
Replies: 3 comments
-
Potentially related to this: But I have experienced similar problems when trying to run axolotl training with very large dataset (ca. 100B tokens at 8k context length). |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hey, could you check the https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/nccl.md docs to see if it helps solving that nccl issue. Alternatively, can you try running multi-node with accelerate https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/multi-node.md |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
##Heres the accelerate config:
##Heres the log record:
"""
gpu004: [2024-02-13 02:00:39,717] [DEBUG] [axolotl.load_tokenizer:248] [PID:3056709] [RANK:7] UNK: 0 /
gpu004: [2024-02-13 02:00:39,717] [INFO] [axolotl.load_tokenizer:259] [PID:3056709] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference.
Downloading readme: 591B [00:00, 802kB/s]
Downloading data: 100%|██████████| 11.7M/11.7M [00:05<00:00, 2.20MB/s]
Generating train split: 100%|██████████| 10000/10000 [00:00<00:00, 121481.07 examples/s]
Tokenizing Prompts (num_proc=64): 4%|▍ | 394/10000 [00:00<00:18, 532.87 examples/s]
gpu003: multiprocess.pool.RemoteTraceback:
gpu003: """
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/train/axolotl/src/axolotl/prompt_tokenizers.py", line 356, in tokenize_prompt
gpu003: for _, part in enumerate(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/prompters.py", line 328, in build_prompt
gpu003: turns = self._build_result(source)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/prompters.py", line 318, in _build_result
gpu003: role = roles[sentence["from"]]
gpu003: KeyError: None
gpu003:
gpu003: The above exception was the direct cause of the following exception:
gpu003:
gpu003: Traceback (most recent call last):
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
gpu003: result = (True, func(*args, **kwds))
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 625, in _write_generator_to_queue
gpu003: for i, result in enumerate(func(**kwargs)):
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3458, in _map_single
gpu003: example = apply_function_on_filtered_inputs(example, i, offset=offset)
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3361, in apply_function_on_filtered_inputs
gpu003: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/prompt_tokenizers.py", line 443, in tokenize_prompt
gpu003: raise InvalidDataException(str(err)) from err
gpu003: axolotl.prompt_tokenizers.InvalidDataException: None
gpu003: """
gpu003:
gpu003: The above exception was the direct cause of the following exception:
gpu003:
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 79, in prepare_dataset
gpu003: train_dataset, eval_dataset, prompters = load_prepare_datasets(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 461, in load_prepare_datasets
gpu003: dataset, prompters = load_tokenized_prepared_datasets(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 403, in load_tokenized_prepared_datasets
gpu003: dataset_wrapper, dataset_prompter = get_dataset_wrapper(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 554, in get_dataset_wrapper
gpu003: dataset_wrapper = TokenizedPromptDataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/datasets.py", line 43, in init
gpu003: self.process(dataset).data,
gpu003: File "/data/vayu/train/axolotl/src/axolotl/datasets.py", line 55, in process
gpu003: return dataset.map(
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper
gpu003: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper
gpu003: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map
gpu003: for rank, done, content in iflatmap_unordered(
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 665, in iflatmap_unordered
gpu003: [async_result.get(timeout=0.05) for async_result in async_results]
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 665, in
gpu003: [async_result.get(timeout=0.05) for async_result in async_results]
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
gpu003: raise self._value
gpu003: axolotl.prompt_tokenizers.InvalidDataException: None
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: Traceback (most recent call last):
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: return _run_code(code, main_globals, None,
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: exec(code, run_globals)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: fire.Fire(do_cli)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: with zero_first(is_main_process()):
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: return next(self.gen)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: dist.barrier()
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: return func(*args, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: work = default_pg.barrier(opts=opts)
gpu005: RuntimeError: [18] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: Traceback (most recent call last):
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu004: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: exec(code, run_globals)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: fire.Fire(do_cli)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: component = fn(*varargs, **kwargs)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: with zero_first(is_main_process()):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: barrier()
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: dist.barrier()
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: return func(*args, **kwargs)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: work = default_pg.barrier(opts=opts)
gpu004: RuntimeError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: Traceback (most recent call last):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: exec(code, run_globals)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: fire.Fire(do_cli)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: component = fn(*varargs, **kwargs)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: with zero_first(is_main_process()):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: barrier()
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: dist.barrier()
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: return func(*args, **kwargs)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: work = default_pg.barrier(opts=opts)
gpu004: RuntimeError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: Traceback (most recent call last):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: exec(code, run_globals)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: fire.Fire(do_cli)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: component = fn(*varargs, **kwargs)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: with zero_first(is_main_process()):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: barrier()
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: dist.barrier()
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: return func(*args, **kwargs)
gpu006: Traceback (most recent call last):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: work = default_pg.barrier(opts=opts)
gpu004: RuntimeError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: Traceback (most recent call last):
gpu004: exec(code, run_globals)
gpu006: return _run_code(code, main_globals, None,
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: return _run_code(code, main_globals, None,
gpu004: fire.Fire(do_cli)
gpu006: exec(code, run_globals)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: exec(code, run_globals)
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: fire.Fire(do_cli)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: fire.Fire(do_cli)
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: component = fn(*varargs, **kwargs)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu006: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: component = fn(*varargs, **kwargs)
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: with zero_first(is_main_process()):
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: return next(self.gen)
gpu006: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: with zero_first(is_main_process()):
gpu004: barrier()
gpu006: return next(self.gen)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: return next(self.gen)
gpu004: dist.barrier()
gpu006: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: barrier()
gpu004: return func(*args, **kwargs)
gpu006: dist.barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: dist.barrier()
gpu004: work = default_pg.barrier(opts=opts)
gpu006: return func(*args, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: RuntimeError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: return func(*args, **kwargs)
gpu004: Traceback (most recent call last):
gpu006: work = default_pg.barrier(opts=opts)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: RuntimeError: [24] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: work = default_pg.barrier(opts=opts)
gpu004: return _run_code(code, main_globals, None,
gpu006: Traceback (most recent call last):
gpu005: RuntimeError: [16] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: exec(code, run_globals)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: Traceback (most recent call last):
gpu006: return _run_code(code, main_globals, None,
gpu004: fire.Fire(do_cli)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: return _run_code(code, main_globals, None,
gpu006: exec(code, run_globals)
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: exec(code, run_globals)
gpu006: fire.Fire(do_cli)
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: fire.Fire(do_cli)
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: component = fn(*varargs, **kwargs)
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: component = fn(*varargs, **kwargs)
gpu003: Traceback (most recent call last):
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu003: return _run_code(code, main_globals, None,
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: exec(code, run_globals)
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: barrier()
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: fire.Fire(do_cli)
gpu006: with zero_first(is_main_process()):
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: dist.barrier()
gpu005: with zero_first(is_main_process()):
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: return next(self.gen)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: return next(self.gen)
gpu004: return func(*args, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: barrier()
gpu004: work = default_pg.barrier(opts=opts)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: RuntimeError: [10] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: dist.barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: dist.barrier()
gpu004: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: return func(*args, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: return func(*args, **kwargs)
gpu004: return _run_code(code, main_globals, None,
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: work = default_pg.barrier(opts=opts)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: work = default_pg.barrier(opts=opts)
gpu004: exec(code, run_globals)
gpu006: RuntimeError: [31] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: RuntimeError: [20] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: Traceback (most recent call last):
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: Traceback (most recent call last):
gpu004: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: dist.barrier()
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: exec(code, run_globals)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: return func(*args, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu005: exec(code, run_globals)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: fire.Fire(do_cli)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: component = fn(*varargs, **kwargs)
gpu005: fire.Fire(do_cli)
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: component = fn(*varargs, **kwargs)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: component = fn(*varargs, **kwargs)
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: with zero_first(is_main_process()):
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: return next(self.gen)
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: barrier()
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: with zero_first(is_main_process()):
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: dist.barrier()
gpu005: with zero_first(is_main_process()):
gpu006: return next(self.gen)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: return func(*args, **kwargs)
gpu005: return next(self.gen)
gpu006: barrier()
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: work = default_pg.barrier(opts=opts)
gpu005: barrier()
gpu006: dist.barrier()
gpu004: RuntimeError: [11] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: Traceback (most recent call last):
gpu005: dist.barrier()
gpu006: return func(*args, **kwargs)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: return _run_code(code, main_globals, None,
gpu005: return func(*args, **kwargs)
gpu006: work = default_pg.barrier(opts=opts)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: exec(code, run_globals)
gpu006: RuntimeError: [27] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: work = default_pg.barrier(opts=opts)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: Traceback (most recent call last):
gpu005: RuntimeError: [17] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: Traceback (most recent call last):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: return _run_code(code, main_globals, None,
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: exec(code, run_globals)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: exec(code, run_globals)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: fire.Fire(do_cli)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: fire.Fire(do_cli)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: component = fn(*varargs, **kwargs)
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: component = fn(*varargs, **kwargs)
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu004: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: barrier()
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: dist.barrier()
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: with zero_first(is_main_process()):
gpu004: return func(*args, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: with zero_first(is_main_process()):
gpu005: return next(self.gen)
gpu004: work = default_pg.barrier(opts=opts)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: RuntimeError: [8] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: barrier()
gpu006: return next(self.gen)
gpu004: Traceback (most recent call last):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: dist.barrier()
gpu006: barrier()
gpu004: return _run_code(code, main_globals, None,
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: return func(*args, **kwargs)
gpu006: dist.barrier()
gpu004: exec(code, run_globals)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: work = default_pg.barrier(opts=opts)
gpu006: return func(*args, **kwargs)
gpu004: fire.Fire(do_cli)
gpu005: RuntimeError: [23] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: Traceback (most recent call last):
gpu006: work = default_pg.barrier(opts=opts)
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: RuntimeError: [26] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: return _run_code(code, main_globals, None,
gpu006: Traceback (most recent call last):
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: exec(code, run_globals)
gpu006: return _run_code(code, main_globals, None,
gpu004: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: fire.Fire(do_cli)
gpu006: exec(code, run_globals)
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: fire.Fire(do_cli)
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: component = fn(*varargs, **kwargs)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu004: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu006: component = fn(*varargs, **kwargs)
gpu004: return next(self.gen)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu004: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: dist.barrier()
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: with zero_first(is_main_process()):
gpu004: return func(*args, **kwargs)
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: return next(self.gen)
gpu004: work = default_pg.barrier(opts=opts)
gpu006: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: RuntimeError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: barrier()
gpu006: return next(self.gen)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: dist.barrier()
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: return func(*args, **kwargs)
gpu006: barrier()
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: dist.barrier()
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: return func(*args, **kwargs)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: work = default_pg.barrier(opts=opts)
gpu006: RuntimeError: [28] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: return _run_code(code, main_globals, None,
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: exec(code, run_globals)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: work = default_pg.barrier(opts=opts)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: with zero_first(is_main_process()):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: RuntimeError: [22] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: Traceback (most recent call last):
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: return _run_code(code, main_globals, None,
gpu006: return next(self.gen)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: exec(code, run_globals)
gpu006: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: dist.barrier()
gpu005: fire.Fire(do_cli)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: return func(*args, **kwargs)
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: work = default_pg.barrier(opts=opts)
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: RuntimeError: [30] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: Traceback (most recent call last):
gpu005: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: return _run_code(code, main_globals, None,
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: exec(code, run_globals)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: with zero_first(is_main_process()):
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: return next(self.gen)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: barrier()
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: with zero_first(is_main_process()):
gpu005: dist.barrier()
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: return next(self.gen)
gpu005: return func(*args, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: barrier()
gpu005: work = default_pg.barrier(opts=opts)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: RuntimeError: [19] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: dist.barrier()
gpu005: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: return _run_code(code, main_globals, None,
gpu006: return func(*args, **kwargs)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: work = default_pg.barrier(opts=opts)
gpu006: RuntimeError: [25] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: return _run_code(code, main_globals, None,
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: exec(code, run_globals)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: exec(code, run_globals)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: with zero_first(is_main_process()):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: fire.Fire(do_cli)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: return next(self.gen)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: component = fn(*varargs, **kwargs)
gpu006: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu006: dist.barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: return func(*args, **kwargs)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: work = default_pg.barrier(opts=opts)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: RuntimeError: [29] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: with zero_first(is_main_process()):
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: return next(self.gen)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: dist.barrier()
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: return func(*args, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: work = default_pg.barrier(opts=opts)
gpu005: RuntimeError: [21] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: [2024-02-13 02:01:06,917] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739792
gpu003: [2024-02-13 02:01:06,917] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739793
gpu005: [2024-02-13 02:01:07,233] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663086
gpu005: [2024-02-13 02:01:07,414] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663087
gpu003: [2024-02-13 02:01:07,548] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739794
gpu005: [2024-02-13 02:01:07,709] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663088
gpu005: [2024-02-13 02:01:07,710] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663089
gpu003: [2024-02-13 02:01:07,770] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739795
gpu004: [2024-02-13 02:01:07,779] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056702
gpu004: [2024-02-13 02:01:07,780] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056703
gpu003: [2024-02-13 02:01:07,818] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739796
gpu004: [2024-02-13 02:01:07,839] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056704
gpu003: [2024-02-13 02:01:07,842] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739797
gpu003: [2024-02-13 02:01:07,866] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739798
gpu004: [2024-02-13 02:01:07,875] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056705
gpu005: [2024-02-13 02:01:07,889] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663090
gpu003: [2024-02-13 02:01:07,890] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739799
gpu004: [2024-02-13 02:01:07,909] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056706
gpu003: [2024-02-13 02:01:07,914] [ERROR] [launch.py:321:sigkill_handler] ['/data/vayu/miniconda3/envs/axo/bin/python', '-u', '-m', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml'] exits with return code = 1
gpu005: [2024-02-13 02:01:07,934] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663091
gpu004: [2024-02-13 02:01:07,940] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056707
gpu005: [2024-02-13 02:01:07,958] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663092
gpu004: [2024-02-13 02:01:07,972] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056708
gpu005: [2024-02-13 02:01:07,982] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663093
gpu004: [2024-02-13 02:01:08,002] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056709
gpu005: [2024-02-13 02:01:08,005] [ERROR] [launch.py:321:sigkill_handler] ['/data/vayu/miniconda3/envs/axo/bin/python', '-u', '-m', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml'] exits with return code = 1
gpu004: [2024-02-13 02:01:08,032] [ERROR] [launch.py:321:sigkill_handler] ['/data/vayu/miniconda3/envs/axo/bin/python', '-u', '-m', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml'] exits with return code = 1
gpu006: [2024-02-13 02:01:08,205] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690903
gpu006: [2024-02-13 02:01:08,205] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690904
gpu006: [2024-02-13 02:01:08,264] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690905
gpu006: [2024-02-13 02:01:08,301] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690906
gpu006: [2024-02-13 02:01:08,329] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690907
gpu006: [2024-02-13 02:01:08,354] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690908
gpu006: [2024-02-13 02:01:08,379] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690909
gpu006: [2024-02-13 02:01:08,403] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690910
gpu006: [2024-02-13 02:01:08,427] [ERROR] [launch.py:321:sigkill_handler] ['/data/vayu/miniconda3/envs/axo/bin/python', '-u', '-m', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml'] exits with return code = 1
pdsh@gpu003: gpu003: ssh exited with exit code 1
pdsh@gpu003: gpu004: ssh exited with exit code 1
pdsh@gpu003: gpu005: ssh exited with exit code 1
pdsh@gpu003: gpu006: ssh exited with exit code 1
Traceback (most recent call last):
File "/data/vayu/miniconda3/envs/axo/bin/accelerate", line 8, in
sys.exit(main())
File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
deepspeed_launcher(args)
File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 712, in deepspeed_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['deepspeed', '--no_local_rank', '--hostfile', '/data/vayu/train/config/hostfile', '--launcher', 'pdsh', '--num_gpus', '8', '--master_port', '9901', '--module', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml']' returned non-zero exit status 1.
"""
Beta Was this translation helpful? Give feedback.
All reactions