RuntimeError: unscale_() has already been called on this optimizer since the last update(). #24449

kunaldeo · 2023-06-23T18:18:07Z

System Info

transformers version: 4.31.0.dev0
accelerate version: 0.21.0.dev0
peft version: 0.4.0.dev0
Platform: Linux-6.3.9-zen1-1-zen-x86_64-with-glibc2.37
Python version: 3.10.11
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:Yes
Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run LoRA training

trainer = Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        auto_find_batch_size=True,
        gradient_accumulation_steps=32,
        warmup_steps=100,
        num_train_epochs=EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=True,
        logging_steps=1,
        evaluation_strategy="steps" if VAL_SET_SIZE > 0 else "no",
        save_strategy="steps",
        eval_steps=50 if VAL_SET_SIZE > 0 else None,
        save_steps=500,
        output_dir=OUTPUT_DIR, #output_dir=repository_id,
        save_total_limit=3,
        load_best_model_at_end=True if VAL_SET_SIZE > 0 else False,
        ddp_find_unused_parameters=False if ddp else None,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (
    lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
).__get__(model, type(model))

if torch.__version__ >= "2" and sys.platform != 'win32':
    model = torch.compile(model)

trainer.train(resume_from_checkpoint = False)

Sometime after 1st epoch I run into the following error

{'loss': 2.2014, 'learning_rate': 0.0002538461538461538, 'epoch': 0.99}                                                                                                                                                              
{'loss': 2.24, 'learning_rate': 0.0002492307692307692, 'epoch': 1.0}                                                                                                                                                                 
{'loss': 2.2383, 'learning_rate': 0.0002446153846153846, 'epoch': 1.01}                                                                                                                                                              
                                                                                                                                                                                                                                   
raceback (most recent call last):███████████████████████████████████▏                                                                                                                            | 112/333 [42:21<1:21:32, 22.14s/it]
  File "/home/kunal/ml/train.py", line 234, in <module>                                                                                                                                            
    trainer.train(resume_from_checkpoint = False)                                                                                                                                                                                    
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/transformers/trainer.py", line 1530, in train                                                                                                                  
    return inner_training_loop(                                                                                                                                                                                                      
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/utils/memory.py", line 132, in decorator                                                                                                            
    return function(batch_size, *args, **kwargs)                                                                                                                                                                                     
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/transformers/trainer.py", line 1843, in _inner_training_loop                                                                                                   
    self.accelerator.clip_grad_norm_(                                                                                                                                                                                                
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/accelerator.py", line 1913, in clip_grad_norm_                                                                                                      
    self.unscale_gradients()                                                                                                                                                                                                         
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/accelerator.py", line 1876, in unscale_gradients                                                                                                    
    self.scaler.unscale_(opt)                                                                                                                                                                                                        
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 275, in unscale_                                                                                                          
    raise RuntimeError("unscale_() has already been called on this optimizer since the last update().")                                                                                                                              
RuntimeError: unscale_() has already been called on this optimizer since the last update().

This training works fine on transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9.

Expected behavior

Training should succeed.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-06-23T18:26:54Z

cc @pacman100 and @muellerzr

pacman100 · 2023-06-24T00:05:37Z

Hello, this is a duplicate issue. Please search the already existing ones. This is fixed via PR #24415

kunaldeo · 2023-06-24T05:22:28Z

Yes this is fixed. Thanks.

kunaldeo closed this as completed Jun 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: unscale_() has already been called on this optimizer since the last update(). #24449

RuntimeError: unscale_() has already been called on this optimizer since the last update(). #24449

kunaldeo commented Jun 23, 2023

sgugger commented Jun 23, 2023

pacman100 commented Jun 24, 2023

kunaldeo commented Jun 24, 2023

RuntimeError: unscale_() has already been called on this optimizer since the last update(). #24449

RuntimeError: unscale_() has already been called on this optimizer since the last update(). #24449

Comments

kunaldeo commented Jun 23, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Jun 23, 2023

pacman100 commented Jun 24, 2023

kunaldeo commented Jun 24, 2023