Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qlora on open llama 13b fails #24245

Closed
2 of 4 tasks
nivibilla opened this issue Jun 13, 2023 · 18 comments
Closed
2 of 4 tasks

Qlora on open llama 13b fails #24245

nivibilla opened this issue Jun 13, 2023 · 18 comments

Comments

@nivibilla
Copy link

nivibilla commented Jun 13, 2023

System Info

Installed by !pip install -q -U git+https://github.com/huggingface/transformers.git
On databricks

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import transformers

trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        save_steps=250,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=5,
        # max_steps=5,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir=models[model_name]['folder_name'],
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <command-412498178049036>:21
      3 trainer = transformers.Trainer(
      4     model=peft_model,
      5     train_dataset=data["train"],
   (...)
     18     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
     19 )
     20 model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
---> 21 trainer.train()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1532     self.model_wrapped = self.model
   1534 inner_training_loop = find_executable_batch_size(
   1535     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1536 )
-> 1537 return inner_training_loop(
   1538     args=args,
   1539     resume_from_checkpoint=resume_from_checkpoint,
   1540     trial=trial,
   1541     ignore_keys_for_eval=ignore_keys_for_eval,
   1542 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/transformers/trainer.py:1860, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1855         nn.utils.clip_grad_norm_(
   1856             amp.master_params(self.optimizer),
   1857             args.max_grad_norm,
   1858         )
   1859     else:
-> 1860         self.accelerator.clip_grad_norm_(
   1861             model.parameters(),
   1862             args.max_grad_norm,
   1863         )
   1865 # Optimizer step
   1866 optimizer_was_run = True

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/accelerate/accelerator.py:1908, in Accelerator.clip_grad_norm_(self, parameters, max_norm, norm_type)
   1904 elif self.distributed_type == DistributedType.DEEPSPEED:
   1905     # `accelerator.backward(loss)` is doing that automatically. Therefore, its implementation is not needed
   1906     # We cannot return the gradient norm because DeepSpeed does it.
   1907     return None
-> 1908 self.unscale_gradients()
   1909 return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/accelerate/accelerator.py:1871, in Accelerator.unscale_gradients(self, optimizer)
   1869 while isinstance(opt, AcceleratedOptimizer):
   1870     opt = opt.optimizer
-> 1871 self.scaler.unscale_(opt)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:275, in GradScaler.unscale_(self, optimizer)
    272 optimizer_state = self._per_optimizer_states[id(optimizer)]
    274 if optimizer_state["stage"] is OptState.UNSCALED:
--> 275     raise RuntimeError("unscale_() has already been called on this optimizer since the last update().")
    276 elif optimizer_state["stage"] is OptState.STEPPED:
    277     raise RuntimeError("unscale_() is being called after step().")

RuntimeError: unscale_() has already been called on this optimizer since the last update().

Interestingly failed at exactly 1 Epoch

Expected behavior

Run normally?

@amyeroberts
Copy link
Collaborator

Hi @nivibilla,

Please make sure to search the issues first, as it's possible they have previously been reported and resolved e.g.:
#24050
#23935

Could you try installing accelerate, peft and transformers from source, and rerunning your script

pip install git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/accelerate.git

@nivibilla
Copy link
Author

sorry mb, I am already installing from source so Im not sure what went wrong. In any case, will test again and let you know

@nivibilla
Copy link
Author

nivibilla commented Jun 13, 2023

I did as you asked @amyeroberts , installed from source. But I still get the same error.

@nivibilla
Copy link
Author

!pip install -q torch==2.0.1 torchvision torchaudio
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U git+https://github.com/huggingface/datasets.git
!pip install -q -U einops
!pip install -q -U sentencepiece

@nivibilla
Copy link
Author

nivibilla commented Jun 14, 2023

Was fixed when I used this particular branch

!pip install git+https://github.com/huggingface/transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9

Will this branch be merged?

@nivibilla
Copy link
Author

Note I am using 4bit quantisation in training, which may be the cause of the issue as mentioned in #23935

@nivibilla
Copy link
Author

Another issue I have encountered with the branch I tested is that it doesn't save a adapter_config.json for the checkpoints.

@nivibilla
Copy link
Author

nivibilla commented Jun 14, 2023

Update:

fixed the adapter_config saving issue by

from transformers import TrainerCallback
class PeftSavingCallback(TrainerCallback):
    def on_save(self, args, state, control, **kwargs):
        checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
        kwargs["model"].save_pretrained(checkpoint_path)

        if "pytorch_model.bin" in os.listdir(checkpoint_path):
            os.remove(os.path.join(checkpoint_path, "pytorch_model.bin"))

However the issue still remains when using the normal installation instead of the particular commit mentioned

@amyeroberts
Copy link
Collaborator

Was fixed when I used this particular branch

That's great to hear! Peculiar that it didn't work from source though 🤔

Will this branch be merged?

This commit has already been merged, I believe, and is part of the latest release. Could you confirm the version of transformers that was installed when the problem was happening initially?

Another issue I have encountered with the branch I tested is that it doesn't save a adapter_config.json for the checkpoints.

Hmmmm.... I have no idea about this cc @pacman100 who knows a lot more about Peft and Trainer :)

@nivibilla
Copy link
Author

Could you confirm the version?

I did transformers.version and got 4.31.0.dev0

@richardr1126
Copy link

I had this same issue today, always stopped around 1 epoch with the same error. I was trying to fine-tune llama-13b as well, on my own dataset, which I know is correctly formatted.

@richardr1126
Copy link

Using git source pip install too. Trying !pip install git+https://github.com/huggingface/transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9 is currently working on the second epoch. Thank you @nivibilla!

@amyeroberts
Copy link
Collaborator

cc @younesbelkada As you've been working on the related issue

@nivibilla
Copy link
Author

@richardr1126 are your checkpoints saving properly? I had to write a custom call back as the adapter_config wasn't being written

@richardr1126
Copy link

richardr1126 commented Jun 16, 2023

@richardr1126 are your checkpoints saving properly? I had to write a custom call back as the adapter_config wasn't being written

Yeah, I used your PeftSavingCallback below and added it to the callbacks param in the Trainer. It created the adapter_config and adapter_model and saved them into the checkpoint-XXX folder after every save step, which I set to 100. I am using Colab so I downloaded the adapter_model and config to my local computer, then uploaded it to Hugging Face as a LoRA adapter using the Upload files button on the model repo.

from trl import SFTTrainer
from transformers import TrainerCallback
import os

class PeftSavingCallback(TrainerCallback):
    def on_save(self, args, state, control, **kwargs):
        checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
        kwargs["model"].save_pretrained(checkpoint_path)

        if "pytorch_model.bin" in os.listdir(checkpoint_path):
            os.remove(os.path.join(checkpoint_path, "pytorch_model.bin"))

trainer = SFTTrainer(
    model=model,
    train_dataset=sql,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=176,
    tokenizer=tokenizer,
    args=training_arguments,
    callbacks=[PeftSavingCallback]
)

@pacman100
Copy link
Contributor

Interestingly failed at exactly 1 Epoch

Hello @nivibilla, PR #24415 should fix this. Can you confirm the same?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@nivibilla
Copy link
Author

I think this works. Haven't tested though. Will close for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants