-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flex xpu bug fix #26135
Flex xpu bug fix #26135
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@muellerzr Could you confirm if this fix is OK and if there's anywhere else that needs to be updated? |
and (self.device.type != "xpu") | ||
and (get_xla_device_type(self.device) != "GPU") | ||
and (self.fp16 or self.fp16_full_eval) | ||
): | ||
raise ValueError( | ||
"FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation" | ||
" (`--fp16_full_eval`) can only be used on CUDA or NPU devices." | ||
" (`--fp16_full_eval`) can only be used on CUDA or NPU devices or certain XPU devices (with IPEX)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see an equivalent of this in accelerate
, and it sounds like we need this to be here. Specifically looking at this chunk of code: https://github.com/huggingface/accelerate/blob/main/src/accelerate/accelerator.py#L427-L428. Can we include a follow-up PR for this in accelerate? Otherwise the trainer section looks fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhilash1910 Would you like to open this PR in accelerate?
@muellerzr Do we need to coordinate these changes at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyeroberts no, it doesn't need a coordination in this case since training_args is the one raising the error, but it's something we want to make sure is a check in accelerate too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll merge this then 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @muellerzr , yes this is needed in accelerate as well. I will open the PR there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amyeroberts for suggestions.
flex gpu bug fix
flex gpu bug fix
flex gpu bug fix
What does this PR do?
For some Flex Intel XPUs there is support for fp16 mp ; hence this exception should not be raise if fp16 is provided as mp dtype .
cc @muellerzr @amyeroberts