-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix llava half precision and autocast issues #29721
Merged
amyeroberts
merged 9 commits into
huggingface:main
from
frasermince:frasermince/half-precision-llava-fix
May 1, 2024
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
6d26a70
Ensure input_embeds and image_features are the same dtype in autocast
frasermince 34e526f
Fix nans in half precision llava-next and fix autocasting behavior.
frasermince a78ae59
Fix styling issues.
frasermince 3d58236
fix randn newline instantiation
frasermince f69af48
fix broken slow llava test
frasermince 74fa019
Fix llava next init.
frasermince 30fb5ef
fix styling issues
frasermince 750244c
[run-slow]llava,llava_next
frasermince a399e4c
fix styling issues
frasermince File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some concerns about putting this here given I haven't seen autocasting tested in the unit tests elsewhere in the repo. Let me know if you prefer this being tested through an integration test with accelerate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your intuition is correct :) We shouldn't need to do this autocasting here. Without it, I'm assuming it fails?
Could you specify what you're thinking re an accelerate integration test and the need for it? Other fp16 test don't specify a specific accelerate version, so not sure on what it would be addressing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of the opposite. It works fine without autocasting in
.half
. The issue is when using a trainer with thefp16
orbf16
flag the model returns a type error. This uses autocasting behind the scenes through accelerate so I wrote these tests as the simplest case to capture this failure. I was not quite sure how to handle this in these tests given we are not testing autocasting behavior elsewhere.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, just to make sure I've understood, the issue arises when passing the model to
Trainer
? Does the following test pass?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the issue arises in passing the model to the trainer with the
fp16
orbf16
flags. The test you included would work just fine. As far as I can tell this is due to how the model works when autocast is used within the trainer (indirectly through accelerate). I was able to replicate the same bug in this test using the autocast block as I see when I run the model through the trainer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. I think we should do this in two steps. Adding an integration test for trainer & autocast for models sounds like a good idea. I suspect it might throw up quite a few things to address and the design of the test is important to make sure it's as lightweight as possible. By splitting up, we can add this fix in quickly and then iterate on the test design / fixing errors it throws.
What I would suggest is:
create_and_check_llava_next_model_fp16_forward
test in this PRtests/trainer/test_trainer.py
- @muellerzr will be able to advise here re where and design :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully agree with this way forward @frasermince !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also document the test with some comments explaining why it's needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a plan! I'll get to work on that! Would we want this new integration test still namespaced under the model something like tests/llava/trainer/test_trainer.py or are you suggesting we add testing more generally to the trainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
test/modeling/llava/test_trainer_llava.py
since for now its model specific?