Gradient accumulation with deepSpeed has issue if not set during configuration #3369

khalil-Hennara · 2025-01-27T06:24:56Z

System Info

`Accelerate` version: 1.2.1
- Platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.11.10
- Numpy version: 2.1.2
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 503.46 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: DEEPSPEED
	- mixed_precision: no
	- use_cpu: False
	- debug: False
	- num_processes: 2
	- machine_rank: 0
	- num_machines: 1
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- enable_cpu_affinity: False
	- deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

following this script https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm_no_trainer.py and training with DeepSpeed Zero-2, if we didn't set the gradient accumulation steps within the deepspeed config and set GAS within the accelerator a strange behavior happens the time is decrease lineally with GAS, for example if I set the GAS= 2 the time decrease twice the original time, and if you increase the GAS the time still go down, This behavior only happens when using DeepSpeed and not set the GAS within the DeepSpeed config. if This fine and this how it's should work please make some update or hint on the usage of GAS within Accelerator, because it might cause a wrong training loop.

Expected behavior

Set the GAS within accelerator should be passed to DeepSpeed config, and work fine but this doesn't happens the time decrease linearlly with GAS and that mean batches are skiped from training.

The text was updated successfully, but these errors were encountered:

khalil-Hennara · 2025-01-27T07:56:18Z

I have a question also for this script, https://github.com/huggingface/accelerate/blob/main/examples/by_feature/gradient_accumulation_for_autoregressive_models.py if I am using DeepSpeed should I update my code like this or deepspeed will mange that,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient accumulation with deepSpeed has issue if not set during configuration #3369

Gradient accumulation with deepSpeed has issue if not set during configuration #3369

khalil-Hennara commented Jan 27, 2025

khalil-Hennara commented Jan 27, 2025 •

edited

Loading

Gradient accumulation with deepSpeed has issue if not set during configuration #3369

Gradient accumulation with deepSpeed has issue if not set during configuration #3369

Comments

khalil-Hennara commented Jan 27, 2025

System Info

Information

Tasks

Reproduction

Expected behavior

khalil-Hennara commented Jan 27, 2025 • edited Loading

khalil-Hennara commented Jan 27, 2025 •

edited

Loading