Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix double bos for vision model #840

Merged
merged 2 commits into from
Jan 14, 2025
Merged

fix double bos for vision model #840

merged 2 commits into from
Jan 14, 2025

Conversation

wukaixingxp
Copy link
Contributor

@wukaixingxp wukaixingxp commented Jan 14, 2025

What does this PR do?

This PR fix double BOS token issue for vision model, as the BOS token is already included in the chat template and when the processor adds the BOS token again. This happened for both inference and fine-tuning.

Fixes Issue #826

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Inference Test
python recipes/quickstart/inference/local_inference/multi_modal_infer.py     --image_path ~/work/dog.jpg     --prompt_text "Describe this image"     --model_name "meta-llama/Llama-3.2-11B-Vision-Instruct"
Loading model: meta-llama/Llama-3.2-11B-Vision-Instruct
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.08it/s]
Input Prompt:
 <|begin_of_text|><|start_header_id|>user<|end_header_id|>

<|image|>Describe this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Generated Text: This image features a small dog, likely a puppy, standing on a skateboard in the middle of a road or street. The dog's coat is predominantly brown and white, with distinctive black markings on its back and legs. Its floppy ears and dark eyes are prominent features.

The skateboard, which the dog is standing on, is black with red wheels. In the background, a blue door is visible, although out of focus. The overall atmosphere of the image suggests that the dog is being showcased in a humorous or playful manner, possibly as part of a meme or joke.<|eot_id|>
  • Finetuning test
~/work/llama-recipes (fix_double_bos)]$ torchrun --nnodes 1 --nproc_per_node 4  recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py"  --run_validation True --batching_strategy padding  --use_peft --peft_method lora
W0113 17:47:42.792000 140312926217216 torch/distributed/run.py:757]
W0113 17:47:42.792000 140312926217216 torch/distributed/run.py:757] *****************************************
W0113 17:47:42.792000 140312926217216 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0113 17:47:42.792000 140312926217216 torch/distributed/run.py:757] *****************************************
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detail
Loading checkpoint shards: 100%|████████████| 5/5 [00:00<00:00,  7.62it/s]
Loading checkpoint shards: 100%|████████████| 5/5 [00:00<00:00,  7.47it/s]
Loading checkpoint shards: 100%|████████████| 5/5 [00:00<00:00,  6.76it/s]
Loading checkpoint shards: 100%|████████████| 5/5 [00:01<00:00,  4.91it/s]
--> Model meta-llama/Llama-3.2-11B-Vision-Instruct

--> meta-llama/Llama-3.2-11B-Vision-Instruct has 10670.220835 Million params

trainable params: 5,898,240 || all params: 10,676,119,075 || trainable%: 0.0552470420998934
bFloat16 enabled for mixed precision - using bfSixteen policy
trainable params: 5,898,240 || all params: 10,676,119,075 || trainable%: 0.0552470420998934
trainable params: 5,898,240 || all params: 10,676,119,075 || trainable%: 0.0552470420998934
trainable params: 5,898,240 || all params: 10,676,119,075 || trainable%: 0.0552470420998934
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> applying fsdp activation checkpointing...
--> Training Set Length = 1800
--> Validation Set Length = 200
length of dataset_train 1800
custom_data_collator is used
--> Num of Training Set Batches loaded = 225
length of dataset_train 1800
custom_data_collator is used
--> Num of Training Set Batches loaded = 225
length of dataset_train 1800
custom_data_collator is used
--> Num of Training Set Batches loaded = 225
length of dataset_train 1800
custom_data_collator is used
--> Num of Training Set Batches loaded = 225
--> Num of Validation Set Batches loaded = 50
--> Num of Validation Set Batches loaded = 50
Starting epoch 0/3
train_config.max_train_step: 0
--> Num of Validation Set Batches loaded = 50
--> Num of Validation Set Batches loaded = 50
Starting epoch 0/3
train_config.max_train_step: 0
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                          | 0/225 [00:00<?, ?it/s]--> Num of Validation Set Batches loaded = 50
--> Num of Validation Set Batches loaded = 50
Starting epoch 0/3
train_config.max_train_step: 0
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                          | 0/225 [00:00<?, ?it/s]--> Num of Validation Set Batches loaded = 50
--> Num of Validation Set Batches loaded = 50
Starting epoch 0/3
train_config.max_train_step: 0
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                          | 0/225 [00:00<?, ?it/s]/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                          | 0/225 [00:00<?, ?it/s]NCCL version 2.20.5+cuda12.4
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/home/kaiwu/miniconda3/envs/llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Training Epoch: 1/3, step 13/225 completed (loss: 0.854179859161377):

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Thanks for contributing 🎉!

@wukaixingxp wukaixingxp marked this pull request as ready for review January 14, 2025 18:32
@wukaixingxp wukaixingxp requested a review from init27 January 14, 2025 18:40
@init27 init27 merged commit 9c3964e into main Jan 14, 2025
3 of 4 checks passed
@init27 init27 deleted the fix_double_bos branch January 14, 2025 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants