-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: AMD MI300 tests fix #30797
CI: AMD MI300 tests fix #30797
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
cc @ydshieh and @younesbelkada (as I think I remember you had handle something similar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! overall looks ok ! I left one suggestion with respect to the expected texts dict, I'll defer to @ydshieh review here for the next steps
"Hi today I am going to share with you a very easy and simple recipe of <strong><em>Kaju Kat", | ||
], | ||
} | ||
if IS_ROCM_SYSTEM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this logic needed ? you could just add the key 9
in the EXPECTED_TEXTS
dict here no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@younesbelkada @ydshieh I have been comparing the generate output on H100 and MI300 (both with 9 device capability) and have seen deviations in the generated text. After discussing this with AMD engineers, this is expected as long as output is sensible.
The deviation can happen because different hardware may process differently, and if there are non-linearities, minor deviations tend to get amplified.
These deviations might not manifest across all models or prompts (as seen in this particular test). But there are few tests in this PR where the output can be different and thus handled separately for ROCM.
We have a few options to address this:
- Since, we do not have H100 tests (?) I could merge the dict, and separate them later if necessary.
- Have if/else statements only for tests where the generated output is different.
- Distinguish the ROCM
EXPECTED
output for each tests as done currently in the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I understand better now, thanks for explaining! in this case, I think option 1 is better + adding a comment explaining that we might need to change that value for H100s in the future! What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, would make the change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @mht-sharma !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -161,7 +161,6 @@ def test_return_sequences(self): | |||
tokenizer=tokenizer, | |||
data_collator=data_collator, | |||
compute_metrics=lambda x: {"samples": x[0].shape[0]}, | |||
report_to="none", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the argument does not exist and error introduced from here: #30266
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks !
will check but likely not today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in the review.
LGTM and thank you a lot for the efforts on AMD CI.
} | ||
if IS_ROCM_SYSTEM: | ||
EXPECTED_TEXTS = { | ||
9: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update those line with
# 8 is for A100 / A10 and 7 for T4
so people knows what 9
means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
"Hi today I am going to share with you a very easy and simple recipe of <strong><em>Kaju Kat", | ||
], | ||
} | ||
if IS_ROCM_SYSTEM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor this block so that one uses directly the dict without the if / else statement? 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
"Hi today I am going to share with you a very easy and simple recipe of <strong><em>Kaju Kat", | ||
], | ||
} | ||
if IS_ROCM_SYSTEM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
"Hi today I am going to show you how to make a very simple and easy to make a very simple and", | ||
], | ||
} | ||
if IS_ROCM_SYSTEM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
], | ||
} | ||
if IS_ROCM_SYSTEM: | ||
EXPECTED_TEXT_COMPLETION = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
torch_device | ||
), | ||
} | ||
if IS_ROCM_SYSTEM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
), | ||
} | ||
if IS_ROCM_SYSTEM: | ||
EXPECTED_LOGITS_LEFT = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! the following suggestion should make the CI happy ! 🤞
|
||
IS_ROCM_SYSTEM = torch.version.hip is not None | ||
IS_CUDA_SYSTEM = torch.version.cuda is not None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: | |
IS_ROCM_SYSTEM = False | |
IS_CUDA_SYSTEM = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing - very nicely handled ❤️
"--report_to", | ||
"none", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we have to add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Codecarbon
is not supported on ROCm, the report callbacks are skipped from the trainer tests. Ref: #30266
@mht-sharma Do you have permission to merge? If not, I can merge in for you |
@amyeroberts I do not have the permissions. Please merge. Thanks for the review 🤗 |
What does this PR do?
Fixes the failing tests on MI300.