Replies: 5 comments 2 replies
-
here's an example with deepspeed https://wandb.ai/axolotl-ai/fft-70b-w-liger/runs/wtexza0s and the yaml to reproduce https://wandb.ai/axolotl-ai/fft-70b-w-liger/runs/wtexza0s/files/tmp/axolotl_config_70oa5qvt.yml |
Beta Was this translation helpful? Give feedback.
-
@winglian And also, I would like to make a donation for your work. However, my circumstances do not allow me to use Visa or MasterCard. Do you have a crypto wallet? |
Beta Was this translation helpful? Give feedback.
-
@winglian I'm also very interested in this configuration and can't access it via the links. Any chance you could add it to examples? |
Beta Was this translation helpful? Give feedback.
-
@sklyar61 @tcleberg sorry about that! the wandb is public now! |
Beta Was this translation helpful? Give feedback.
-
@winglian trying and failing to reproduce that run success- could you share the commands you used to run it and the image version? |
Beta Was this translation helpful? Give feedback.
-
Greetings, Axolotl community members!
I recently attempted to perform a full precision tuning of the llama-3-70B model using the docker image winglian/axolotl-cloud
on runpod.io's cloud service. This experience led me to wonder: is it even feasible to conduct a full precision tuning on a single machine with 640 GB or even 754 GB of memory? Unfortunately, I was unable to get the training started :(
Initially, I fine-tuned the FSDP settings on a machine equipped with 4*A100 for the llama-3-8B model. The training started and concluded successfully.
fsdp:
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
According to the llama repository, full precision tuning requires 500 GB of memory, so I opted for a machine with 640 GB and tried to initiate a full precision setup for the 70B model on a machine with 8A100. I encountered an OOM error. The system reported a shortage of 1.6 GB of memory on the card. I then tried training on 8H100 NVL with a total machine memory of 754 GB. This scenario also ended in error:
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/a
pi.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
axolotl.cli.train FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-09-05_11:17:50
host : d5a636c8b4c6
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 17103)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
In this case, the code indicated a shortage of less than 1 GB of memory..
Here are my significant settings:
base_model: failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5
sequence_len: 2048
sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 1e-5
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
flash_attention: true
fsdp:
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
I'm using deepspeed with config zero2.json and zero3.json
I also tried changing the precision to fp16. Moreover, I initiated training with
using the command 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True accelerate launch --mixed-precision bf16 -m axolotl.cli.train config.yml accelerate launch --mixed-precision bf16 -m axolotl.cli.train config.yml’.
I also attempted to switch the CUDA version to 12.4 and Pytorch to 2.4.0, but to no avail.
I also tried setting the variables:
export CUDA_HOME=/usr/local/cuda-12.1
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export NCCL_P2P_DISABLE=1
export NCCL_TIMEOUT=600
This also did not help.
In this regard, I have a question: has anyone managed to run a full fine-tuning of the 70B model on a single machine, or is this mission impossible just for me??
Beta Was this translation helpful? Give feedback.
All reactions