Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: No inf checks were recorded for this optimizer. #4

Open
erjieyong opened this issue Apr 9, 2023 · 16 comments
Open

Comments

@erjieyong
Copy link

First of all, a great thank you for posting the article and youtube video, it was very insightful!

I've tried to run the code based on your article, however i keep facing the same assertion error. Any advice?

Note that i have been trying to run your code on colab with the free gpu
python version = 3.9.16
cuda version = 11.8.89

I've also noted the bug that you faced and run the same code (edited for colab) as follows:
cp /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cpu.so

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('//172.28.0.1'), PosixPath('8013')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-2ofb2sppvym87 --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//ipykernel.pylab.backend_inline'), PosixPath('module')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2023-04-09 06:56:19.227834: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: /content/gdrive/MyDrive/LTA_guanacos/alpaca-lora/translated_tasks_de_deepl_4k (1).json
output_dir: ./lora-alpaca
batch_size: 128
micro_batch_size: 4
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
resume_from_checkpoint: None

Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100% 33/33 [01:11<00:00, 2.15s/it]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-964d8b7c1c693dbd/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100% 1/1 [00:00<00:00, 3246.37it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 64.22it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-964d8b7c1c693dbd/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100% 1/1 [00:00<00:00, 813.01it/s]
trainable params: 0 || all params: 6755192832 || trainable%: 0.0
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
0% 0/45 [00:00<?, ?it/s]╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/gdrive/MyDrive/LTA_guanacos/alpaca-lora/finetune_language.py:237 in │
│ │
│ │
│ 234 │
│ 235 │
│ 236 if name == "main": │
│ ❱ 237 │ fire.Fire(train) │
│ 238 │
│ │
│ /usr/local/lib/python3.9/dist-packages/fire/core.py:141 in Fire │
│ │
│ 138 │ context.update(caller_globals) │
│ 139 │ context.update(caller_locals) │
│ 140 │
│ ❱ 141 component_trace = _Fire(component, args, parsed_flag_args, context, │
│ 142 │
│ 143 if component_trace.HasError(): │
│ 144 │ _DisplayError(component_trace) │
│ │
│ /usr/local/lib/python3.9/dist-packages/fire/core.py:475 in _Fire │
│ │
│ 472 │ is_class = inspect.isclass(component) │
│ 473 │ │
│ 474 │ try: │
│ ❱ 475 │ │ component, remaining_args = _CallAndUpdateTrace( │
│ 476 │ │ │ component, │
│ 477 │ │ │ remaining_args, │
│ 478 │ │ │ component_trace, │
│ │
│ /usr/local/lib/python3.9/dist-packages/fire/core.py:691 in │
│ _CallAndUpdateTrace │
│ │
│ 688 │ loop = asyncio.get_event_loop() │
│ 689 │ component = loop.run_until_complete(fn(*varargs, **kwargs)) │
│ 690 else: │
│ ❱ 691 │ component = fn(*varargs, **kwargs) │
│ 692 │
│ 693 if treatment == 'class': │
│ 694 │ action = trace.INSTANTIATED_CLASS │
│ │
│ /content/gdrive/MyDrive/LTA_guanacos/alpaca-lora/finetune_language.py:206 in │
│ train │
│ │
│ 203 │ if torch.version >= "2" and sys.platform != "win32": │
│ 204 │ │ model = torch.compile(model) │
│ 205 │ │
│ ❱ 206 │ trainer.train(resume_from_checkpoint=resume_from_checkpoint) │
│ 207 │ │
│ 208 │ model.save_pretrained(output_dir) │
│ 209 │
│ │
│ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1662 in train │
│ │
│ 1659 │ │ inner_training_loop = find_executable_batch_size( │
│ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │
│ 1661 │ │ ) │
│ ❱ 1662 │ │ return inner_training_loop( │
│ 1663 │ │ │ args=args, │
│ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1665 │ │ │ trial=trial, │
│ │
│ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1991 in │
inner_training_loop │
│ │
│ 1988 │ │ │ │ │ │ │ xm.optimizer_step(self.optimizer) │
│ 1989 │ │ │ │ │ elif self.do_grad_scaling: │
│ 1990 │ │ │ │ │ │ scale_before = self.scaler.get_scale() │
│ ❱ 1991 │ │ │ │ │ │ self.scaler.step(self.optimizer) │
│ 1992 │ │ │ │ │ │ self.scaler.update() │
│ 1993 │ │ │ │ │ │ scale_after = self.scaler.get_scale() │
│ 1994 │ │ │ │ │ │ optimizer_was_run = scale_before <= scale_aft │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/cuda/amp/grad_scaler.py:368 in │
│ step │
│ │
│ 365 │ │ if optimizer_state["stage"] is OptState.READY: │
│ 366 │ │ │ self.unscale
(optimizer) │
│ 367 │ │ │
│ ❱ 368 │ │ assert len(optimizer_state["found_inf_per_device"]) > 0, "No i │
│ 369 │ │ │
│ 370 │ │ retval = self._maybe_opt_step(optimizer, optimizer_state, *arg │
│ 371 │
╰──────────────────────────────────────────────────────────────────────────────╯
AssertionError: No inf checks were recorded for this optimizer.

@Sekhar-jami
Copy link

Facing the same error, were you able to resolve the issue?

@diogopublio
Copy link

same error here
any tips on fixing would be great

@nishantb06
Copy link

same error, possible solutions?

@d4nielmeyer
Copy link

Same error. Deep gratitude for any ideas

@Kraegge
Copy link

Kraegge commented Apr 23, 2023

I have the same problem.

@erjieyong
Copy link
Author

erjieyong commented Apr 24, 2023

Based on the example code given, I believe that most of you would have seen the same log message wherein 0 param is trainable, which resulted in the inf error

trainable params: 0 || all params: 6755192832 || trainable%: 0.0

I manage to find another way to load and train from an existing adapter by using resume_from_checkpoint in the original alpaca_lora github.

What you need to do is

  1. use the repo from alpaca_lora instead
  2. Instead of adjusting any code in finetune.py, just to pass in your downloaded adapter path as part of the parameter. Example:
python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8
    --resume_from_checkpoint='./alpaca-lora-7b'

@AzizovAlisher
Copy link

@erjieyong not working, it starts fine-tuning LLAMA, instead of Alpaca. Any other ways?

@AzizovAlisher
Copy link

@erjieyong but have to point it out, it resolves the problem with 0 trainable params. Just not in a right way

@seyyedaliayati
Copy link

Based on the example code given, I believe that most of you would have seen the same log message wherein 0 param is trainable, which resulted in the inf error

trainable params: 0 || all params: 6755192832 || trainable%: 0.0

I manage to find another way to load and train from an existing adapter by using resume_from_checkpoint in the original alpaca_lora github.

What you need to do is

  1. use the repo from alpaca_lora instead
  2. Instead of adjusting any code in finetune.py, just to pass in your downloaded adapter path as part of the parameter. Example:
python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8
    --resume_from_checkpoint='./alpaca-lora-7b'

Do you mean replacing base model with alpaca? because this command finetunes llama @erjieyong

1 similar comment
@seyyedaliayati
Copy link

Based on the example code given, I believe that most of you would have seen the same log message wherein 0 param is trainable, which resulted in the inf error

trainable params: 0 || all params: 6755192832 || trainable%: 0.0

I manage to find another way to load and train from an existing adapter by using resume_from_checkpoint in the original alpaca_lora github.

What you need to do is

  1. use the repo from alpaca_lora instead
  2. Instead of adjusting any code in finetune.py, just to pass in your downloaded adapter path as part of the parameter. Example:
python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8
    --resume_from_checkpoint='./alpaca-lora-7b'

Do you mean replacing base model with alpaca? because this command finetunes llama @erjieyong

@d4nielmeyer
Copy link

Finally I solved it by initializing the model = get_peft_model(model, config) after model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16) and config = LoraConfig(...). So don't comment the config. Worked quite well for me.

@fredi-python
Copy link

fredi-python commented May 9, 2023

@d4nielmeyer could you make a pull request?

@fredi-python
Copy link

@d4nielmeyer or, please send the code here?

@d4nielmeyer
Copy link

Finetune.py

[...]
model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map=device_map,
    )
[...]
model = prepare_model_for_int8_training(model)

LORA_WEIGHTS = "tloen/alpaca-lora-7b"
model = PeftModel.from_pretrained(
        model,
        LORA_WEIGHTS,
        torch_dtype=torch.float16,
    )

config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=lora_target_modules,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
    )
model = get_peft_model(model, config)
[...]

@seyyedaliayati
Copy link

ig = LoraConfig(...). So don't comment the config. Worked quite well for me.

@d4nielmeyer Thanks! Issue solved for me.

@AhmedSSoliman
Copy link

AhmedSSoliman commented Jun 30, 2023

In the parameters inside LoraConfig, I think you are writing inference_mode=True.
Change the inference_mode to False as in the following example:

config = LoraConfig(
peft_type="LORA",
r=8,
lora_alpha=32,
inference_mode=False,
target_modules=["q_proj", "v_proj", "out_proj", "fc1", "fc2","lm_head"],
lora_dropout=0.05,
bias="none",
task_type = "SEQ_2_SEQ_LM"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants