-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero infinity xpu support #4130
Conversation
Some change seems not necessary. I wonder if it's model related. For example, |
Hi @tjruwase , I did a modification to current PR. Removed env variable dependence. Let me explain some change that this PR has.
2)More copy on XPU
|
Requiring aligned buffers is a deliberate design decision to simplify the library and ensure high performance. We prefer for the alignment issues to be addressed on the client side where there is more flexibility. For example, clients can easily add padding in a way that minimizes perf impact. However, I think the library needs to be improved to assert on unaligned buffers to make this requirement more explicit. |
Can you please share an example of this failure case, perhaps in the form of a unit test? |
The buffers that get used in swap in and out are all created and initialized in deepspeed python codes. In client side we cannot affect it by changing model codes. By the way, I see that in link deepspeed test has a flag that can create aligned tensor which also calls |
In my env of Megatron-deepspeed with Adam on xpu devices, when run to |
@tjruwase Any comments? Thanks! |
Actually, the expectation is that the client side send aligned buffers to the library. This is why the tensor swapping utility of deepspeed has many alignment codes to guarantee this condition. Can you share a bit about your client side? Perhaps, deepspeed can expose another wapper to provide this alignment. Would that help? |
Yes, good observation. Our plan is to use this functionality within deepspeed and also expose to clients. We hope to do that soon. |
I am very curious about how this mismatch could happen. It concerns me that our swapping logic is making wrong assumptions. Is it possible to share more details of the training run? Or is it possible to create unit test? Thanks! |
I add the alignment in get_accelerator(), which will be easy for different platform to create or adjust their buffer and don't influence current library. Exposing another wrapper to client also works for me. I think both can exist together. |
I find the root cause of this mismatch now. The reason is that Deepspeed assumes that optimizer state tensors have same numel as param and so swap_in&swap_out can use each tensors' numel or param's numel as length. But the truth is different optimizer have different situation. It could be different numel. For apex.optimizers.FusedAdam, the optimizer states have same numel as param: apex optimizer states init I found that my previous change is naive and not enough. So I fallback to use sgd optimizer whose state has same numel as param to bypass this issue. For this part, it can be in another PR or do you want to fix it? @tjruwase |
This solution is fine with me. Thanks! |
Sorry, one more adjustment to this. Perhaps we can use |
@Liangliang-Ma, this PR looks good to me after addressing the final suggestions. Thanks so much for this great work. |
@tjruwase I committed the change as you suggested. Thanks for your reviewing these days! |
I think this is failing because the unittest env is ubuntu 18.04 which does not have |
This should be updated to be 20.04 now. cc: @tjruwase and @Liangliang-Ma Is there somewhere we should specify a min libaio version/Ubuntu version required for this? |
@loadams, thanks for updating the CI ubuntu. I think the package version is tied to the OS version, so we need a way to specify minimum Ubuntu version. Unfortunately, I don't know how to do that. I am not sure we have done this previously. @mrwyattii, @jeffra any thoughts here? Thanks! |
@Liangliang-Ma, it looks like CI issues are resolved, except for formatting. Can you please address that? Thanks! |
I think we should at a minimum put this in the README in the requirements section? Also perhaps list the failure mode if one doesn't have this? |
Great point @loadams. @Liangliang-Ma, could please replace |
@tjruwase Thanks for solving CI issue! Checker function and formatting modification is committed. |
@tjruwase seems this PR has been blocked by a cpu inference unit test in merge queue, which is out of this PR's scope. Can you please help to check it? Thanks! |
Hi @Liangliang-Ma - this is a known issue, should be resolved in this PR then we can unblock the merge queue. |
* zero infinity xpu support * remove env var depends * client align mem * sync with all accelerators' * format fix * add align in pin_memory api * add missing brackets * remove align * modify pin_memory api * modify pin_memory api to use only on align para * change value of align bytes * Update csrc/aio/common/deepspeed_aio_common.cpp * add version check and change format --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
Currently AIO kernel has gap using on XPU device.
This PR fixes issues that
Developed and Tested on Megatron Deepspeed.