Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: AMD MI300 tests fix #30797

Merged
merged 7 commits into from
May 21, 2024
Merged

CI: AMD MI300 tests fix #30797

merged 7 commits into from
May 21, 2024

Conversation

mht-sharma
Copy link
Contributor

What does this PR do?

Fixes the failing tests on MI300.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@amyeroberts
Copy link
Collaborator

cc @ydshieh and @younesbelkada (as I think I remember you had handle something similar)

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! overall looks ok ! I left one suggestion with respect to the expected texts dict, I'll defer to @ydshieh review here for the next steps

"Hi today I am going to share with you a very easy and simple recipe of <strong><em>Kaju Kat",
],
}
if IS_ROCM_SYSTEM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this logic needed ? you could just add the key 9 in the EXPECTED_TEXTS dict here no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@younesbelkada @ydshieh I have been comparing the generate output on H100 and MI300 (both with 9 device capability) and have seen deviations in the generated text. After discussing this with AMD engineers, this is expected as long as output is sensible.

The deviation can happen because different hardware may process differently, and if there are non-linearities, minor deviations tend to get amplified.

These deviations might not manifest across all models or prompts (as seen in this particular test). But there are few tests in this PR where the output can be different and thus handled separately for ROCM.

We have a few options to address this:

  1. Since, we do not have H100 tests (?) I could merge the dict, and separate them later if necessary.
  2. Have if/else statements only for tests where the generated output is different.
  3. Distinguish the ROCM EXPECTED output for each tests as done currently in the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand better now, thanks for explaining! in this case, I think option 1 is better + adding a comment explaining that we might need to change that value for H100s in the future! What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, would make the change!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @mht-sharma !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -161,7 +161,6 @@ def test_return_sequences(self):
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=lambda x: {"samples": x[0].shape[0]},
report_to="none",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the argument does not exist and error introduced from here: #30266

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks !

@ydshieh ydshieh self-assigned this May 14, 2024
@ydshieh
Copy link
Collaborator

ydshieh commented May 14, 2024

will check but likely not today

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in the review.

LGTM and thank you a lot for the efforts on AMD CI.

}
if IS_ROCM_SYSTEM:
EXPECTED_TEXTS = {
9: [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update those line with

# 8 is for A100 / A10 and 7 for T4

so people knows what 9 means

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@ydshieh ydshieh marked this pull request as ready for review May 16, 2024 12:56
"Hi today I am going to share with you a very easy and simple recipe of <strong><em>Kaju Kat",
],
}
if IS_ROCM_SYSTEM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor this block so that one uses directly the dict without the if / else statement? 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

"Hi today I am going to share with you a very easy and simple recipe of <strong><em>Kaju Kat",
],
}
if IS_ROCM_SYSTEM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

"Hi today I am going to show you how to make a very simple and easy to make a very simple and",
],
}
if IS_ROCM_SYSTEM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

],
}
if IS_ROCM_SYSTEM:
EXPECTED_TEXT_COMPLETION = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

torch_device
),
}
if IS_ROCM_SYSTEM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

),
}
if IS_ROCM_SYSTEM:
EXPECTED_LOGITS_LEFT = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@mht-sharma mht-sharma requested a review from younesbelkada May 17, 2024 14:30
Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! the following suggestion should make the CI happy ! 🤞


IS_ROCM_SYSTEM = torch.version.hip is not None
IS_CUDA_SYSTEM = torch.version.cuda is not None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else:
IS_ROCM_SYSTEM = False
IS_CUDA_SYSTEM = False

Copy link
Contributor Author

@mht-sharma mht-sharma May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much !

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again !

@younesbelkada younesbelkada requested a review from amyeroberts May 21, 2024 07:43
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing - very nicely handled ❤️

Comment on lines +3067 to +3068
"--report_to",
"none",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we have to add this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Codecarbon is not supported on ROCm, the report callbacks are skipped from the trainer tests. Ref: #30266

@amyeroberts
Copy link
Collaborator

@mht-sharma Do you have permission to merge? If not, I can merge in for you

@mht-sharma
Copy link
Contributor Author

@mht-sharma Do you have permission to merge? If not, I can merge in for you

@amyeroberts I do not have the permissions. Please merge. Thanks for the review 🤗

@amyeroberts amyeroberts merged commit 7a4792e into huggingface:main May 21, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants