Qlora on open llama 13b fails #24245

nivibilla · 2023-06-13T12:34:50Z

System Info

Installed by !pip install -q -U git+https://github.com/huggingface/transformers.git
On databricks

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import transformers

trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        save_steps=250,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=5,
        # max_steps=5,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir=models[model_name]['folder_name'],
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <command-412498178049036>:21
      3 trainer = transformers.Trainer(
      4     model=peft_model,
      5     train_dataset=data["train"],
   (...)
     18     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
     19 )
     20 model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
---> 21 trainer.train()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1532     self.model_wrapped = self.model
   1534 inner_training_loop = find_executable_batch_size(
   1535     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1536 )
-> 1537 return inner_training_loop(
   1538     args=args,
   1539     resume_from_checkpoint=resume_from_checkpoint,
   1540     trial=trial,
   1541     ignore_keys_for_eval=ignore_keys_for_eval,
   1542 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/transformers/trainer.py:1860, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1855         nn.utils.clip_grad_norm_(
   1856             amp.master_params(self.optimizer),
   1857             args.max_grad_norm,
   1858         )
   1859     else:
-> 1860         self.accelerator.clip_grad_norm_(
   1861             model.parameters(),
   1862             args.max_grad_norm,
   1863         )
   1865 # Optimizer step
   1866 optimizer_was_run = True

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/accelerate/accelerator.py:1908, in Accelerator.clip_grad_norm_(self, parameters, max_norm, norm_type)
   1904 elif self.distributed_type == DistributedType.DEEPSPEED:
   1905     # `accelerator.backward(loss)` is doing that automatically. Therefore, its implementation is not needed
   1906     # We cannot return the gradient norm because DeepSpeed does it.
   1907     return None
-> 1908 self.unscale_gradients()
   1909 return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/accelerate/accelerator.py:1871, in Accelerator.unscale_gradients(self, optimizer)
   1869 while isinstance(opt, AcceleratedOptimizer):
   1870     opt = opt.optimizer
-> 1871 self.scaler.unscale_(opt)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:275, in GradScaler.unscale_(self, optimizer)
    272 optimizer_state = self._per_optimizer_states[id(optimizer)]
    274 if optimizer_state["stage"] is OptState.UNSCALED:
--> 275     raise RuntimeError("unscale_() has already been called on this optimizer since the last update().")
    276 elif optimizer_state["stage"] is OptState.STEPPED:
    277     raise RuntimeError("unscale_() is being called after step().")

RuntimeError: unscale_() has already been called on this optimizer since the last update().

Interestingly failed at exactly 1 Epoch

Expected behavior

Run normally?

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-06-13T12:52:16Z

Hi @nivibilla,

Please make sure to search the issues first, as it's possible they have previously been reported and resolved e.g.:
#24050
#23935

Could you try installing accelerate, peft and transformers from source, and rerunning your script

pip install git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git

nivibilla · 2023-06-13T12:54:37Z

sorry mb, I am already installing from source so Im not sure what went wrong. In any case, will test again and let you know

nivibilla · 2023-06-13T14:18:28Z

I did as you asked @amyeroberts , installed from source. But I still get the same error.

nivibilla · 2023-06-13T14:20:12Z

!pip install -q torch==2.0.1 torchvision torchaudio
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U git+https://github.com/huggingface/datasets.git
!pip install -q -U einops
!pip install -q -U sentencepiece

nivibilla · 2023-06-14T06:45:30Z

Was fixed when I used this particular branch

!pip install git+https://github.com/huggingface/transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9

Will this branch be merged?

nivibilla · 2023-06-14T06:50:17Z

Note I am using 4bit quantisation in training, which may be the cause of the issue as mentioned in #23935

nivibilla · 2023-06-14T10:32:17Z

Another issue I have encountered with the branch I tested is that it doesn't save a adapter_config.json for the checkpoints.

nivibilla · 2023-06-14T13:20:53Z

Update:

fixed the adapter_config saving issue by

from transformers import TrainerCallback
class PeftSavingCallback(TrainerCallback):
    def on_save(self, args, state, control, **kwargs):
        checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
        kwargs["model"].save_pretrained(checkpoint_path)

        if "pytorch_model.bin" in os.listdir(checkpoint_path):
            os.remove(os.path.join(checkpoint_path, "pytorch_model.bin"))

However the issue still remains when using the normal installation instead of the particular commit mentioned

amyeroberts · 2023-06-14T15:52:13Z

Was fixed when I used this particular branch

That's great to hear! Peculiar that it didn't work from source though 🤔

Will this branch be merged?

This commit has already been merged, I believe, and is part of the latest release. Could you confirm the version of transformers that was installed when the problem was happening initially?

Another issue I have encountered with the branch I tested is that it doesn't save a adapter_config.json for the checkpoints.

Hmmmm.... I have no idea about this cc @pacman100 who knows a lot more about Peft and Trainer :)

nivibilla · 2023-06-15T15:06:49Z

Could you confirm the version?

I did transformers.version and got 4.31.0.dev0

richardr1126 · 2023-06-16T01:42:43Z

I had this same issue today, always stopped around 1 epoch with the same error. I was trying to fine-tune llama-13b as well, on my own dataset, which I know is correctly formatted.

richardr1126 · 2023-06-16T01:44:22Z

Using git source pip install too. Trying !pip install git+https://github.com/huggingface/transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9 is currently working on the second epoch. Thank you @nivibilla!

amyeroberts · 2023-06-16T10:31:11Z

cc @younesbelkada As you've been working on the related issue

nivibilla · 2023-06-16T10:35:35Z

@richardr1126 are your checkpoints saving properly? I had to write a custom call back as the adapter_config wasn't being written

richardr1126 · 2023-06-16T19:17:17Z

@richardr1126 are your checkpoints saving properly? I had to write a custom call back as the adapter_config wasn't being written

Yeah, I used your PeftSavingCallback below and added it to the callbacks param in the Trainer. It created the adapter_config and adapter_model and saved them into the checkpoint-XXX folder after every save step, which I set to 100. I am using Colab so I downloaded the adapter_model and config to my local computer, then uploaded it to Hugging Face as a LoRA adapter using the Upload files button on the model repo.

from trl import SFTTrainer
from transformers import TrainerCallback
import os

class PeftSavingCallback(TrainerCallback):
    def on_save(self, args, state, control, **kwargs):
        checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
        kwargs["model"].save_pretrained(checkpoint_path)

        if "pytorch_model.bin" in os.listdir(checkpoint_path):
            os.remove(os.path.join(checkpoint_path, "pytorch_model.bin"))

trainer = SFTTrainer(
    model=model,
    train_dataset=sql,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=176,
    tokenizer=tokenizer,
    args=training_arguments,
    callbacks=[PeftSavingCallback]
)

pacman100 · 2023-06-22T12:05:46Z

Interestingly failed at exactly 1 Epoch

Hello @nivibilla, PR #24415 should fix this. Can you confirm the same?

github-actions · 2023-07-16T15:02:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

nivibilla · 2023-07-16T15:28:34Z

I think this works. Haven't tested though. Will close for now

jiacheo mentioned this issue Jun 21, 2023

[Question] Lora微调训练的时候报错 baichuan-inc/Baichuan-7B#72

Open

5 tasks

nivibilla mentioned this issue Jun 22, 2023

fix the grad_acc issue at epoch boundaries #24415

Merged

nivibilla closed this as completed Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qlora on open llama 13b fails #24245

Qlora on open llama 13b fails #24245

nivibilla commented Jun 13, 2023 •

edited

Loading

amyeroberts commented Jun 13, 2023

nivibilla commented Jun 13, 2023

nivibilla commented Jun 13, 2023 •

edited

Loading

nivibilla commented Jun 13, 2023

nivibilla commented Jun 14, 2023 •

edited

Loading

nivibilla commented Jun 14, 2023

nivibilla commented Jun 14, 2023

nivibilla commented Jun 14, 2023 •

edited

Loading

amyeroberts commented Jun 14, 2023

nivibilla commented Jun 15, 2023

richardr1126 commented Jun 16, 2023

richardr1126 commented Jun 16, 2023

amyeroberts commented Jun 16, 2023

nivibilla commented Jun 16, 2023

richardr1126 commented Jun 16, 2023 •

edited

Loading

pacman100 commented Jun 22, 2023

github-actions bot commented Jul 16, 2023

nivibilla commented Jul 16, 2023

Qlora on open llama 13b fails #24245

Qlora on open llama 13b fails #24245

Comments

nivibilla commented Jun 13, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Jun 13, 2023

nivibilla commented Jun 13, 2023

nivibilla commented Jun 13, 2023 • edited Loading

nivibilla commented Jun 13, 2023

nivibilla commented Jun 14, 2023 • edited Loading

nivibilla commented Jun 14, 2023

nivibilla commented Jun 14, 2023

nivibilla commented Jun 14, 2023 • edited Loading

amyeroberts commented Jun 14, 2023

nivibilla commented Jun 15, 2023

richardr1126 commented Jun 16, 2023

richardr1126 commented Jun 16, 2023

amyeroberts commented Jun 16, 2023

nivibilla commented Jun 16, 2023

richardr1126 commented Jun 16, 2023 • edited Loading

pacman100 commented Jun 22, 2023

github-actions bot commented Jul 16, 2023

nivibilla commented Jul 16, 2023

nivibilla commented Jun 13, 2023 •

edited

Loading

nivibilla commented Jun 13, 2023 •

edited

Loading

nivibilla commented Jun 14, 2023 •

edited

Loading

nivibilla commented Jun 14, 2023 •

edited

Loading

richardr1126 commented Jun 16, 2023 •

edited

Loading