Can Axolotl Perform Full Precision Tuning of Llama-3-70B #1898

sklyar61 · 2024-09-05T13:27:50Z

sklyar61
Sep 5, 2024

Greetings, Axolotl community members!

I recently attempted to perform a full precision tuning of the llama-3-70B model using the docker image winglian/axolotl-cloud
on runpod.io's cloud service. This experience led me to wonder: is it even feasible to conduct a full precision tuning on a single machine with 640 GB or even 754 GB of memory? Unfortunately, I was unable to get the training started :(

Initially, I fine-tuned the FSDP settings on a machine equipped with 4*A100 for the llama-3-8B model. The training started and concluded successfully.

fsdp:

full_shard
auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD

According to the llama repository, full precision tuning requires 500 GB of memory, so I opted for a machine with 640 GB and tried to initiate a full precision setup for the 70B model on a machine with 8A100. I encountered an OOM error. The system reported a shortage of 1.6 GB of memory on the card. I then tried training on 8H100 NVL with a total machine memory of 754 GB. This scenario also ended in error:

File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/a
pi.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

axolotl.cli.train FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-05_11:17:50
host : d5a636c8b4c6
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 17103)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

In this case, the code indicated a shortage of less than 1 GB of memory..

Here are my significant settings:
base_model: failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: true

gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 1e-5
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
flash_attention: true

fsdp:

full_shard
auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD

I'm using deepspeed with config zero2.json and zero3.json

I also tried changing the precision to fp16. Moreover, I initiated training with
using the command 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True accelerate launch --mixed-precision bf16 -m axolotl.cli.train config.yml accelerate launch --mixed-precision bf16 -m axolotl.cli.train config.yml’.

I also attempted to switch the CUDA version to 12.4 and Pytorch to 2.4.0, but to no avail.

I also tried setting the variables:
export CUDA_HOME=/usr/local/cuda-12.1
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

export NCCL_P2P_DISABLE=1
export NCCL_TIMEOUT=600

This also did not help.

In this regard, I have a question: has anyone managed to run a full fine-tuning of the 70B model on a single machine, or is this mission impossible just for me??

winglian · 2024-09-05T18:08:53Z

winglian
Sep 5, 2024
Maintainer

here's an example with deepspeed https://wandb.ai/axolotl-ai/fft-70b-w-liger/runs/wtexza0s and the yaml to reproduce https://wandb.ai/axolotl-ai/fft-70b-w-liger/runs/wtexza0s/files/tmp/axolotl_config_70oa5qvt.yml

0 replies

sklyar61 · 2024-09-06T08:22:07Z

sklyar61
Sep 6, 2024
Author

@winglian
Thank you very much for your attention. These links lead to a 404 error. I'm not familiar with wandb and I'm probably doing something wrong, but in the repository https://wandb.ai/axolotl-ai?shareProfileType=copy there are only three projects and fft-70b-w-liger is not among them: (

And also, I would like to make a donation for your work. However, my circumstances do not allow me to use Visa or MasterCard. Do you have a crypto wallet?

1 reply

NanoCode012 Oct 31, 2024
Collaborator

Just checking in, were you able to successfully run the config?

tcleberg · 2024-09-14T00:47:57Z

tcleberg
Sep 14, 2024

@winglian I'm also very interested in this configuration and can't access it via the links. Any chance you could add it to examples?

0 replies

winglian · 2024-09-14T03:20:05Z

winglian
Sep 14, 2024
Maintainer

@sklyar61 @tcleberg sorry about that! the wandb is public now!

0 replies

tcleberg · 2024-09-20T13:47:46Z

tcleberg
Sep 20, 2024

@winglian trying and failing to reproduce that run success- could you share the commands you used to run it and the image version?

1 reply

NanoCode012 Oct 31, 2024
Collaborator

@tcleberg , could you elaborate on what went wrong? Looking at the linked wandb, it used 8xH100 80gb and ran using the normal commands. accelerate launch -m axolotl.cli.train fft.yaml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can Axolotl Perform Full Precision Tuning of Llama-3-70B #1898

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can Axolotl Perform Full Precision Tuning of Llama-3-70B #1898

sklyar61 Sep 5, 2024

File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/a pi.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

axolotl.cli.train FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-09-05_11:17:50 host : d5a636c8b4c6 rank : 7 (local_rank: 7) exitcode : 1 (pid: 17103) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Replies: 5 comments · 2 replies

winglian Sep 5, 2024 Maintainer

sklyar61 Sep 6, 2024 Author

NanoCode012 Oct 31, 2024 Collaborator

tcleberg Sep 14, 2024

winglian Sep 14, 2024 Maintainer

tcleberg Sep 20, 2024

NanoCode012 Oct 31, 2024 Collaborator

sklyar61
Sep 5, 2024

File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/a
pi.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-05_11:17:50
host : d5a636c8b4c6
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 17103)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Replies: 5 comments 2 replies

winglian
Sep 5, 2024
Maintainer

sklyar61
Sep 6, 2024
Author

NanoCode012 Oct 31, 2024
Collaborator

tcleberg
Sep 14, 2024

winglian
Sep 14, 2024
Maintainer

tcleberg
Sep 20, 2024

NanoCode012 Oct 31, 2024
Collaborator