Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: unscale_() has already been called on this optimizer since the last update(). #24449

Closed
2 of 4 tasks
kunaldeo opened this issue Jun 23, 2023 · 3 comments
Closed
2 of 4 tasks

Comments

@kunaldeo
Copy link

System Info

  • transformers version: 4.31.0.dev0
  • accelerate version: 0.21.0.dev0
  • peft version: 0.4.0.dev0
  • Platform: Linux-6.3.9-zen1-1-zen-x86_64-with-glibc2.37
  • Python version: 3.10.11
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Run LoRA training
trainer = Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        auto_find_batch_size=True,
        gradient_accumulation_steps=32,
        warmup_steps=100,
        num_train_epochs=EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=True,
        logging_steps=1,
        evaluation_strategy="steps" if VAL_SET_SIZE > 0 else "no",
        save_strategy="steps",
        eval_steps=50 if VAL_SET_SIZE > 0 else None,
        save_steps=500,
        output_dir=OUTPUT_DIR, #output_dir=repository_id,
        save_total_limit=3,
        load_best_model_at_end=True if VAL_SET_SIZE > 0 else False,
        ddp_find_unused_parameters=False if ddp else None,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (
    lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
).__get__(model, type(model))

if torch.__version__ >= "2" and sys.platform != 'win32':
    model = torch.compile(model)

trainer.train(resume_from_checkpoint = False)
  1. Sometime after 1st epoch I run into the following error
{'loss': 2.2014, 'learning_rate': 0.0002538461538461538, 'epoch': 0.99}                                                                                                                                                              
{'loss': 2.24, 'learning_rate': 0.0002492307692307692, 'epoch': 1.0}                                                                                                                                                                 
{'loss': 2.2383, 'learning_rate': 0.0002446153846153846, 'epoch': 1.01}                                                                                                                                                              
                                                                                                                                                                                                                                   
raceback (most recent call last):███████████████████████████████████▏                                                                                                                            | 112/333 [42:21<1:21:32, 22.14s/it]
  File "/home/kunal/ml/train.py", line 234, in <module>                                                                                                                                            
    trainer.train(resume_from_checkpoint = False)                                                                                                                                                                                    
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/transformers/trainer.py", line 1530, in train                                                                                                                  
    return inner_training_loop(                                                                                                                                                                                                      
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/utils/memory.py", line 132, in decorator                                                                                                            
    return function(batch_size, *args, **kwargs)                                                                                                                                                                                     
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/transformers/trainer.py", line 1843, in _inner_training_loop                                                                                                   
    self.accelerator.clip_grad_norm_(                                                                                                                                                                                                
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/accelerator.py", line 1913, in clip_grad_norm_                                                                                                      
    self.unscale_gradients()                                                                                                                                                                                                         
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/accelerate/accelerator.py", line 1876, in unscale_gradients                                                                                                    
    self.scaler.unscale_(opt)                                                                                                                                                                                                        
  File "/home/kunal/miniconda3/envs/lora/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 275, in unscale_                                                                                                          
    raise RuntimeError("unscale_() has already been called on this optimizer since the last update().")                                                                                                                              
RuntimeError: unscale_() has already been called on this optimizer since the last update().  

This training works fine on transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9.

Expected behavior

Training should succeed.

@sgugger
Copy link
Collaborator

sgugger commented Jun 23, 2023

cc @pacman100 and @muellerzr

@pacman100
Copy link
Contributor

Hello, this is a duplicate issue. Please search the already existing ones. This is fixed via PR #24415

@kunaldeo
Copy link
Author

Yes this is fixed. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants