-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix llava half precision and autocast issues #29721
Fix llava half precision and autocast issues #29721
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Could you add a test e.g. like create_and_check_reformer_model_fp16_forward
that we have for other models?
Apologies on the wait on this. After implementing the test, it looks like the problem here is not quite what I thought it was. I will try to respond soon with either fixing the correct problem or closing this PR. |
1f8208b
to
b4188e3
Compare
After some further investigation, I tracked down exactly what was happening. I also fixed the same issue on llava_next. In the process I discovered that there were NaNs occuring upon running llava_next in half precision. To fix this I used |
|
||
self.image_newline = nn.Parameter(torch.empty(config.text_config.hidden_size, dtype=self.dtype)) | ||
self.image_newline = nn.Parameter( | ||
torch.randn(config.text_config.hidden_size, dtype=config.torch_dtype or self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We needed some way to set this half precision in autocast environments. Let me know if this is incorrect. I am still unclear when to use torch_dtype
and when not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I should note I am not quite certain randn
is correct here but the empty
was causing NaNs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just from looking at this, I suspect the issue is that the param self.image_newline
is being create as an empty tensor, but it's not then being initialized in _init_weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I will move this initialization to _init_weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a second glance it looks sufficient to do it on initialization. The original repo just had a a separate function to initialize the vision tower.
model = LlavaForConditionalGeneration(config=config) | ||
model.to(torch_device) | ||
model.eval() | ||
with torch.autocast(device_type="cuda", dtype=torch.float16): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some concerns about putting this here given I haven't seen autocasting tested in the unit tests elsewhere in the repo. Let me know if you prefer this being tested through an integration test with accelerate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your intuition is correct :) We shouldn't need to do this autocasting here. Without it, I'm assuming it fails?
Could you specify what you're thinking re an accelerate integration test and the need for it? Other fp16 test don't specify a specific accelerate version, so not sure on what it would be addressing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of the opposite. It works fine without autocasting in .half
. The issue is when using a trainer with the fp16
or bf16
flag the model returns a type error. This uses autocasting behind the scenes through accelerate so I wrote these tests as the simplest case to capture this failure. I was not quite sure how to handle this in these tests given we are not testing autocasting behavior elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, just to make sure I've understood, the issue arises when passing the model to Trainer
? Does the following test pass?
def create_and_check_llava_model_fp16_forward(self, config, input_ids, pixel_values, attention_mask):
model = LlavaForConditionalGeneration(config=config).to(torch_device).half().eval()
output = model(input_ids, attention_mask=attention_mask)["last_hidden_state"]
self.parent.assertFalse(torch.isnan(output).any().item())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the issue arises in passing the model to the trainer with the fp16
or bf16
flags. The test you included would work just fine. As far as I can tell this is due to how the model works when autocast is used within the trainer (indirectly through accelerate). I was able to replicate the same bug in this test using the autocast block as I see when I run the model through the trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. I think we should do this in two steps. Adding an integration test for trainer & autocast for models sounds like a good idea. I suspect it might throw up quite a few things to address and the design of the test is important to make sure it's as lightweight as possible. By splitting up, we can add this fix in quickly and then iterate on the test design / fixing errors it throws.
What I would suggest is:
- Keep the change to llava next and the
create_and_check_llava_next_model_fp16_forward
test in this PR - Open a new PR for adding an integration test. This would possibly sit under
tests/trainer/test_trainer.py
- @muellerzr will be able to advise here re where and design :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully agree with this way forward @frasermince !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also document the test with some comments explaining why it's needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a plan! I'll get to work on that! Would we want this new integration test still namespaced under the model something like tests/llava/trainer/test_trainer.py or are you suggesting we add testing more generally to the trainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe test/modeling/llava/test_trainer_llava.py
since for now its model specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this!
model = LlavaForConditionalGeneration(config=config) | ||
model.to(torch_device) | ||
model.eval() | ||
with torch.autocast(device_type="cuda", dtype=torch.float16): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your intuition is correct :) We shouldn't need to do this autocasting here. Without it, I'm assuming it fails?
Could you specify what you're thinking re an accelerate integration test and the need for it? Other fp16 test don't specify a specific accelerate version, so not sure on what it would be addressing here
|
||
self.image_newline = nn.Parameter(torch.empty(config.text_config.hidden_size, dtype=self.dtype)) | ||
self.image_newline = nn.Parameter( | ||
torch.randn(config.text_config.hidden_size, dtype=config.torch_dtype or self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just from looking at this, I suspect the issue is that the param self.image_newline
is being create as an empty tensor, but it's not then being initialized in _init_weights
|
||
self.image_newline = nn.Parameter(torch.empty(config.text_config.hidden_size, dtype=self.dtype)) | ||
self.image_newline = nn.Parameter( | ||
torch.randn(config.text_config.hidden_size, dtype=config.torch_dtype or self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be taken from the instance
torch.randn(config.text_config.hidden_size, dtype=config.torch_dtype or self.dtype) | |
torch.randn(config.text_config.hidden_size, dtype=self.dtype) |
24b4244
to
0db9a95
Compare
I’ll need to dig into this when I’m back to my computer, but @amyeroberts just to describe how the autocast system works in accelerate quickly: We wrap around the models’ .forward() with an autocast manager, so that the original weights remain in fp32. Not doing so leads to a large deal of precision and end accuracy issues, which is why we maintain this reference. It uses a pinch more memory (than pure doing If certain weights need to be cast for certain models, I think that can be okay and taken as a case-by-case basis, but I want to re-read this discussion carefully to understand why. (And we should test training for a few epochs/steps to make sure accuracies etc remain as expected when using pretrained weights) |
Sounds good! Let me know if I can do anything to make the reason for these changes more clear! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think these are good tests how they are, let's please keep a close eye on it and see what other models have issues. We may want to change some things with Accelerate (like specifying certain layers to be in the precision fully) if tests show no loss in performance :)
Sounds good! I will also add the trainer test in a separate PR as requested. |
@muellerzr before merging do you have any suggestions on ensuring accuracies remain as expected? Is it basically taking the assertions in some of the integration tests after a couple of training steps? |
Yes indeed! (Assertion or |
8e9b5b7
to
003c89d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this and iterating with us to find a solution!
Only last thing is to run the slow model tests:
RUN_SLOW=1 pytest tests/models/llava tests/models/llava_next
Once they're all confirmed to be passing we're good to merge!
There was one that seemed to be already failing before my changes. I took the liberty of changing the assertion slightly. Let me know if this is acceptable! There are a couple failing in main for me for llava_next as well but that would require me changing the expected logits. Not sure if that is acceptable or not. |
@amyeroberts leave to your discretion :) |
Given slow the test was failing on main I changed the logits in the test for now. If that is wrong I can remove. |
@frasermince Great - thanks for checking! I'm going to be a bit annoying here - as the differences are so small it's quite possible they're just coming from running on different hardware / slightly different env setups. So we might need to revert. @ydshieh How can we run the slow tests on this branch in the new github actions workflow to make sure the values are aligned? |
@amyeroberts I am on it and should be ready soon (even possible today 🔥 ) |
It is in #30540 :-) |
@frasermince Thanks for your patience! We've just merged in #30540 (thanks @ydshieh!) so should be able to run the slow tests on our runners. To trigger this could you:
|
4839ccf
to
f0ca873
Compare
Thanks for triggering the tests! OK, so we're hitting OOM errors, which is going to be a pain to try and resolve as it's a wider issue. What I suggest is:
|
3ef96e4
to
75b4a76
Compare
75b4a76
to
750244c
Compare
@frasermince Thanks for all of your patience iterating with us. We're good to merge! |
What does this PR do?
Currently half precision training on LLaVa is broken due to the language embedding output always being full precision. This PR matches the behavior of the original LLaVa implementation by casting the result of language embedding to the image_features' dtype. See here for relevant line in the original repo.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.