AssertionError: No inf checks were recorded for this optimizer. #4

erjieyong · 2023-04-09T07:10:51Z

First of all, a great thank you for posting the article and youtube video, it was very insightful!

I've tried to run the code based on your article, however i keep facing the same assertion error. Any advice?

Note that i have been trying to run your code on colab with the free gpu
python version = 3.9.16
cuda version = 11.8.89

I've also noted the bug that you faced and run the same code (edited for colab) as follows:
cp /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cpu.so

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('//172.28.0.1'), PosixPath('8013')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-2ofb2sppvym87 --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//ipykernel.pylab.backend_inline'), PosixPath('module')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2023-04-09 06:56:19.227834: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: /content/gdrive/MyDrive/LTA_guanacos/alpaca-lora/translated_tasks_de_deepl_4k (1).json
output_dir: ./lora-alpaca
batch_size: 128
micro_batch_size: 4
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
resume_from_checkpoint: None

Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100% 33/33 [01:11<00:00, 2.15s/it]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-964d8b7c1c693dbd/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100% 1/1 [00:00<00:00, 3246.37it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 64.22it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-964d8b7c1c693dbd/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100% 1/1 [00:00<00:00, 813.01it/s]
trainable params: 0 || all params: 6755192832 || trainable%: 0.0
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
0% 0/45 [00:00<?, ?it/s]╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/gdrive/MyDrive/LTA_guanacos/alpaca-lora/finetune_language.py:237 in │
│ │
│ │
│ 234 │
│ 235 │
│ 236 if name == "main": │
│ ❱ 237 │ fire.Fire(train) │
│ 238 │
│ │
│ /usr/local/lib/python3.9/dist-packages/fire/core.py:141 in Fire │
│ │
│ 138 │ context.update(caller_globals) │
│ 139 │ context.update(caller_locals) │
│ 140 │
│ ❱ 141 component_trace = _Fire(component, args, parsed_flag_args, context, │
│ 142 │
│ 143 if component_trace.HasError(): │
│ 144 │ _DisplayError(component_trace) │
│ │
│ /usr/local/lib/python3.9/dist-packages/fire/core.py:475 in _Fire │
│ │
│ 472 │ is_class = inspect.isclass(component) │
│ 473 │ │
│ 474 │ try: │
│ ❱ 475 │ │ component, remaining_args = _CallAndUpdateTrace( │
│ 476 │ │ │ component, │
│ 477 │ │ │ remaining_args, │
│ 478 │ │ │ component_trace, │
│ │
│ /usr/local/lib/python3.9/dist-packages/fire/core.py:691 in │
│ _CallAndUpdateTrace │
│ │
│ 688 │ loop = asyncio.get_event_loop() │
│ 689 │ component = loop.run_until_complete(fn(*varargs, **kwargs)) │
│ 690 else: │
│ ❱ 691 │ component = fn(*varargs, **kwargs) │
│ 692 │
│ 693 if treatment == 'class': │
│ 694 │ action = trace.INSTANTIATED_CLASS │
│ │
│ /content/gdrive/MyDrive/LTA_guanacos/alpaca-lora/finetune_language.py:206 in │
│ train │
│ │
│ 203 │ if torch.version >= "2" and sys.platform != "win32": │
│ 204 │ │ model = torch.compile(model) │
│ 205 │ │
│ ❱ 206 │ trainer.train(resume_from_checkpoint=resume_from_checkpoint) │
│ 207 │ │
│ 208 │ model.save_pretrained(output_dir) │
│ 209 │
│ │
│ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1662 in train │
│ │
│ 1659 │ │ inner_training_loop = find_executable_batch_size( │
│ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │
│ 1661 │ │ ) │
│ ❱ 1662 │ │ return inner_training_loop( │
│ 1663 │ │ │ args=args, │
│ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1665 │ │ │ trial=trial, │
│ │
│ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1991 in │
│ inner_training_loop │
│ │
│ 1988 │ │ │ │ │ │ │ xm.optimizer_step(self.optimizer) │
│ 1989 │ │ │ │ │ elif self.do_grad_scaling: │
│ 1990 │ │ │ │ │ │ scale_before = self.scaler.get_scale() │
│ ❱ 1991 │ │ │ │ │ │ self.scaler.step(self.optimizer) │
│ 1992 │ │ │ │ │ │ self.scaler.update() │
│ 1993 │ │ │ │ │ │ scale_after = self.scaler.get_scale() │
│ 1994 │ │ │ │ │ │ optimizer_was_run = scale_before <= scale_aft │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/cuda/amp/grad_scaler.py:368 in │
│ step │
│ │
│ 365 │ │ if optimizer_state["stage"] is OptState.READY: │
│ 366 │ │ │ self.unscale(optimizer) │
│ 367 │ │ │
│ ❱ 368 │ │ assert len(optimizer_state["found_inf_per_device"]) > 0, "No i │
│ 369 │ │ │
│ 370 │ │ retval = self._maybe_opt_step(optimizer, optimizer_state, *arg │
│ 371 │
╰──────────────────────────────────────────────────────────────────────────────╯
AssertionError: No inf checks were recorded for this optimizer.

The text was updated successfully, but these errors were encountered:

Sekhar-jami · 2023-04-12T14:44:28Z

Facing the same error, were you able to resolve the issue?

diogopublio · 2023-04-17T00:54:59Z

same error here
any tips on fixing would be great

nishantb06 · 2023-04-19T15:07:20Z

same error, possible solutions?

d4nielmeyer · 2023-04-22T11:17:51Z

Same error. Deep gratitude for any ideas

Kraegge · 2023-04-23T20:19:10Z

I have the same problem.

erjieyong · 2023-04-24T01:19:56Z

Based on the example code given, I believe that most of you would have seen the same log message wherein 0 param is trainable, which resulted in the inf error

trainable params: 0 || all params: 6755192832 || trainable%: 0.0

I manage to find another way to load and train from an existing adapter by using resume_from_checkpoint in the original alpaca_lora github.

What you need to do is

use the repo from alpaca_lora instead
Instead of adjusting any code in finetune.py, just to pass in your downloaded adapter path as part of the parameter. Example:

python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8
    --resume_from_checkpoint='./alpaca-lora-7b'

AzizovAlisher · 2023-04-27T21:32:31Z

@erjieyong not working, it starts fine-tuning LLAMA, instead of Alpaca. Any other ways?

AzizovAlisher · 2023-04-27T21:38:42Z

@erjieyong but have to point it out, it resolves the problem with 0 trainable params. Just not in a right way

seyyedaliayati · 2023-05-08T20:28:53Z

Based on the example code given, I believe that most of you would have seen the same log message wherein 0 param is trainable, which resulted in the inf error

trainable params: 0 || all params: 6755192832 || trainable%: 0.0

I manage to find another way to load and train from an existing adapter by using resume_from_checkpoint in the original alpaca_lora github.

What you need to do is

use the repo from alpaca_lora instead

Instead of adjusting any code in finetune.py, just to pass in your downloaded adapter path as part of the parameter. Example:
python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8
    --resume_from_checkpoint='./alpaca-lora-7b'

Do you mean replacing base model with alpaca? because this command finetunes llama @erjieyong

seyyedaliayati · 2023-05-08T20:32:31Z

Based on the example code given, I believe that most of you would have seen the same log message wherein 0 param is trainable, which resulted in the inf error

trainable params: 0 || all params: 6755192832 || trainable%: 0.0

I manage to find another way to load and train from an existing adapter by using resume_from_checkpoint in the original alpaca_lora github.

What you need to do is

use the repo from alpaca_lora instead

Instead of adjusting any code in finetune.py, just to pass in your downloaded adapter path as part of the parameter. Example:
python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8
    --resume_from_checkpoint='./alpaca-lora-7b'

Do you mean replacing base model with alpaca? because this command finetunes llama @erjieyong

d4nielmeyer · 2023-05-09T09:50:06Z

Finally I solved it by initializing the model = get_peft_model(model, config) after model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16) and config = LoraConfig(...). So don't comment the config. Worked quite well for me.

fredi-python · 2023-05-09T15:25:23Z

@d4nielmeyer could you make a pull request?

fredi-python · 2023-05-09T15:45:07Z

@d4nielmeyer or, please send the code here?

d4nielmeyer · 2023-05-09T16:31:36Z

Finetune.py

[...]
model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map=device_map,
    )
[...]
model = prepare_model_for_int8_training(model)

LORA_WEIGHTS = "tloen/alpaca-lora-7b"
model = PeftModel.from_pretrained(
        model,
        LORA_WEIGHTS,
        torch_dtype=torch.float16,
    )

config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=lora_target_modules,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
    )
model = get_peft_model(model, config)
[...]

seyyedaliayati · 2023-05-09T17:50:04Z

ig = LoraConfig(...). So don't comment the config. Worked quite well for me.

@d4nielmeyer Thanks! Issue solved for me.

AhmedSSoliman · 2023-06-30T03:19:13Z

In the parameters inside LoraConfig, I think you are writing inference_mode=True.
Change the inference_mode to False as in the following example:

config = LoraConfig(
peft_type="LORA",
r=8,
lora_alpha=32,
inference_mode=False,
target_modules=["q_proj", "v_proj", "out_proj", "fc1", "fc2","lm_head"],
lora_dropout=0.05,
bias="none",
task_type = "SEQ_2_SEQ_LM"
)

yodiaditya mentioned this issue Jul 10, 2023

solve No inf checks were recorded for this optimizer in finetuning.py #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: No inf checks were recorded for this optimizer. #4

AssertionError: No inf checks were recorded for this optimizer. #4

erjieyong commented Apr 9, 2023

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Sekhar-jami commented Apr 12, 2023

diogopublio commented Apr 17, 2023

nishantb06 commented Apr 19, 2023

d4nielmeyer commented Apr 22, 2023

Kraegge commented Apr 23, 2023

erjieyong commented Apr 24, 2023 •

edited

Loading

AzizovAlisher commented Apr 27, 2023

AzizovAlisher commented Apr 27, 2023

seyyedaliayati commented May 8, 2023

seyyedaliayati commented May 8, 2023

d4nielmeyer commented May 9, 2023

fredi-python commented May 9, 2023 •

edited

Loading

fredi-python commented May 9, 2023

d4nielmeyer commented May 9, 2023

seyyedaliayati commented May 9, 2023

AhmedSSoliman commented Jun 30, 2023 •

edited

Loading

AssertionError: No inf checks were recorded for this optimizer. #4

AssertionError: No inf checks were recorded for this optimizer. #4

Comments

erjieyong commented Apr 9, 2023

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Sekhar-jami commented Apr 12, 2023

diogopublio commented Apr 17, 2023

nishantb06 commented Apr 19, 2023

d4nielmeyer commented Apr 22, 2023

Kraegge commented Apr 23, 2023

erjieyong commented Apr 24, 2023 • edited Loading

AzizovAlisher commented Apr 27, 2023

AzizovAlisher commented Apr 27, 2023

seyyedaliayati commented May 8, 2023

seyyedaliayati commented May 8, 2023

d4nielmeyer commented May 9, 2023

fredi-python commented May 9, 2023 • edited Loading

fredi-python commented May 9, 2023

d4nielmeyer commented May 9, 2023

seyyedaliayati commented May 9, 2023

AhmedSSoliman commented Jun 30, 2023 • edited Loading

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

erjieyong commented Apr 24, 2023 •

edited

Loading

fredi-python commented May 9, 2023 •

edited

Loading

AhmedSSoliman commented Jun 30, 2023 •

edited

Loading