Zero infinity xpu support #4130

Liangliang-Ma · 2023-08-10T05:54:25Z

Currently AIO kernel has gap using on XPU device.
This PR fixes issues that

cpu buffer only detect CUDA tensor in aio threads
make is_pinned() more general in all kinds of accelerator
io_getevents() stopped by signal on XPU

Developed and Tested on Megatron Deepspeed.

csrc/aio/py_lib/deepspeed_aio_thread.cpp

csrc/aio/py_lib/deepspeed_py_aio_handle.cpp

csrc/aio/common/deepspeed_aio_common.cpp

Liangliang-Ma · 2023-08-24T05:11:11Z

Some change seems not necessary. I wonder if it's model related. For example, assert(_num_pending_ops > 0); should not raise error if thread's queue works fine. But in Megatron-deepspeed it fails. For io_pgetevents, I will test it more to reply you. Thanks.

Liangliang-Ma · 2023-08-28T11:14:48Z

Hi @tjruwase , I did a modification to current PR. Removed env variable dependence. Let me explain some change that this PR has.

use io_pgetevents instead of io_getevents
It's indeed more general and it sometimes can help to block signals from interrupting the io process. The performance of these two are same.

2)More copy on XPU
This is because Async IO requires aligned mem to be used as buffer. Currently it needs manually set to allocate an aligned place, otherwise io would fail. This could be optimized in future.

changes in _swap_out_unpinned_tensors
In deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py, def swap_in_optimizer_state read data into buffer whose sizes are all swap_info.numel(). But for def _swap_out_unpinned_tensors in def swap_out_optimizer_state, write buffer sizes are numel of each tensor. This could be a mismatch in some cases.

tjruwase · 2023-08-31T00:49:09Z

2)More copy on XPU This is because Async IO requires aligned mem to be used as buffer. Currently it needs manually set to allocate an aligned place, otherwise io would fail. This could be optimized in future.

Requiring aligned buffers is a deliberate design decision to simplify the library and ensure high performance. We prefer for the alignment issues to be addressed on the client side where there is more flexibility. For example, clients can easily add padding in a way that minimizes perf impact. However, I think the library needs to be improved to assert on unaligned buffers to make this requirement more explicit.

tjruwase · 2023-08-31T00:49:53Z

3. changes in _swap_out_unpinned_tensors
In deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py, def swap_in_optimizer_state read data into buffer whose sizes are all swap_info.numel(). But for def _swap_out_unpinned_tensors in def swap_out_optimizer_state, write buffer sizes are numel of each tensor. This could be a mismatch in some cases.

Can you please share an example of this failure case, perhaps in the form of a unit test?

Liangliang-Ma · 2023-09-05T05:41:27Z

2)More copy on XPU This is because Async IO requires aligned mem to be used as buffer. Currently it needs manually set to allocate an aligned place, otherwise io would fail. This could be optimized in future.

Requiring aligned buffers is a deliberate design decision to simplify the library and ensure high performance. We prefer for the alignment issues to be addressed on the client side where there is more flexibility. For example, clients can easily add padding in a way that minimizes perf impact. However, I think the library needs to be improved to assert on unaligned buffers to make this requirement more explicit.

The buffers that get used in swap in and out are all created and initialized in deepspeed python codes. In client side we cannot affect it by changing model codes. By the way, I see that in link deepspeed test has a flag that can create aligned tensor which also calls posix_memalign to initialize buffer. And this flag seems not get used in deepspeed. What about adding it to let client choose how to create their buffer?

Liangliang-Ma · 2023-09-05T06:46:01Z

changes in _swap_out_unpinned_tensors
In deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py, def swap_in_optimizer_state read data into buffer whose sizes are all swap_info.numel(). But for def _swap_out_unpinned_tensors in def swap_out_optimizer_state, write buffer sizes are numel of each tensor. This could be a mismatch in some cases.

Can you please share an example of this failure case, perhaps in the form of a unit test?

In my env of Megatron-deepspeed with Adam on xpu devices, when run to def _update_param_state_info, the optimizer state that get added to swap_info can be a shape of [1, 22843136, 22843136], which will result in mismatch of swap in&out optimizer states. Because in swap_in the buffer is set to [22843136, 22843136, 22843136] and [256, 22843136, 22843136] in swap_out(aligned). There is an assert to check this mismatch in link.

Liangliang-Ma · 2023-09-14T05:51:04Z

@tjruwase Any comments? Thanks!

tjruwase · 2023-09-14T16:45:11Z

The buffers that get used in swap in and out are all created and initialized in deepspeed python codes. In client side we cannot affect it by changing model codes.

Actually, the expectation is that the client side send aligned buffers to the library. This is why the tensor swapping utility of deepspeed has many alignment codes to guarantee this condition. Can you share a bit about your client side? Perhaps, deepspeed can expose another wapper to provide this alignment. Would that help?

tjruwase · 2023-09-14T16:49:54Z

By the way, I see that in link deepspeed test has a flag that can create aligned tensor which also calls posix_memalign to initialize buffer. And this flag seems not get used in deepspeed. What about adding it to let client choose how to create their buffer?

Yes, good observation. Our plan is to use this functionality within deepspeed and also expose to clients. We hope to do that soon.

tjruwase · 2023-09-14T16:52:53Z

In my env of Megatron-deepspeed with Adam on xpu devices, when run to def _update_param_state_info, the optimizer state that get added to swap_info can be a shape of [1, 22843136, 22843136], which will result in mismatch of swap in&out optimizer states. Because in swap_in the buffer is set to [22843136, 22843136, 22843136] and [256, 22843136, 22843136] in swap_out(aligned).

I am very curious about how this mismatch could happen. It concerns me that our swapping logic is making wrong assumptions. Is it possible to share more details of the training run? Or is it possible to create unit test? Thanks!

Liangliang-Ma · 2023-09-19T08:35:30Z

Actually, the expectation is that the client side send aligned buffers to the library. This is why the tensor swapping utility of deepspeed has many alignment codes to guarantee this condition. Can you share a bit about your client side? Perhaps, deepspeed can expose another wapper to provide this alignment. Would that help?

I add the alignment in get_accelerator(), which will be easy for different platform to create or adjust their buffer and don't influence current library. Exposing another wrapper to client also works for me. I think both can exist together.

Liangliang-Ma · 2023-09-19T08:54:11Z

I am very curious about how this mismatch could happen. It concerns me that our swapping logic is making wrong assumptions. Is it possible to share more details of the training run? Or is it possible to create unit test? Thanks!

I find the root cause of this mismatch now. The reason is that Deepspeed assumes that optimizer state tensors have same numel as param and so swap_in&swap_out can use each tensors' numel or param's numel as length. But the truth is different optimizer have different situation. It could be different numel.

For apex.optimizers.FusedAdam, the optimizer states have same numel as param: apex optimizer states init
For torch.optim.Adam, the optimizer states have three tensor and the first one is scalar: torch adam

I found that my previous change is naive and not enough. So I fallback to use sgd optimizer whose state has same numel as param to bypass this issue. For this part, it can be in another PR or do you want to fix it? @tjruwase

tjruwase · 2023-09-19T12:29:12Z

I add the alignment in get_accelerator(), which will be easy for different platform to create or adjust their buffer and don't

This solution is fine with me. Thanks!

tjruwase · 2023-09-22T12:54:01Z

so that -1 value for alignment_bytes means to use system default such as sysconf(_SC_PAGESIZE)? We will document this behavior. What do you think?

Yes, I agree. It will be cleaner.

Sorry, one more adjustment to this. Perhaps we can use 0 value as system default instead of -1. I don't think this change affects the PR but will affect documentation.

csrc/aio/common/deepspeed_aio_common.cpp

deepspeed/runtime/swap_tensor/partitioned_param_swapper.py

deepspeed/runtime/swap_tensor/utils.py

tjruwase · 2023-09-22T12:58:53Z

@Liangliang-Ma, this PR looks good to me after addressing the final suggestions. Thanks so much for this great work.

Liangliang-Ma · 2023-09-25T01:14:27Z

@Liangliang-Ma, this PR looks good to me after addressing the final suggestions. Thanks so much for this great work.

@tjruwase I committed the change as you suggested. Thanks for your reviewing these days!

tjruwase · 2023-09-27T14:34:22Z

Another thing is that in previous workflow checks, I found that io_pgetevents cannot be found in your unittest env when building AIO ops.

I will pay attention to the new workflow for this issue.

I think this is failing because the unittest env is ubuntu 18.04 which does not have io_pgetevents available. Is it possible for the code to use io_getevents or io_pgetevents as appropriate? I will also investigate if ubuntu version of the unittest can be updated. Thanks!

loadams · 2023-09-27T21:16:45Z

Another thing is that in previous workflow checks, I found that io_pgetevents cannot be found in your unittest env when building AIO ops.

I will pay attention to the new workflow for this issue.

I think this is failing because the unittest env is ubuntu 18.04 which does not have io_pgetevents available. Is it possible for the code to use io_getevents or io_pgetevents as appropriate? I will also investigate if ubuntu version of the unittest can be updated. Thanks!

This should be updated to be 20.04 now. cc: @tjruwase and @Liangliang-Ma

Is there somewhere we should specify a min libaio version/Ubuntu version required for this?

tjruwase · 2023-09-27T21:39:09Z

@loadams, thanks for updating the CI ubuntu. I think the package version is tied to the OS version, so we need a way to specify minimum Ubuntu version. Unfortunately, I don't know how to do that. I am not sure we have done this previously.

@mrwyattii, @jeffra any thoughts here? Thanks!

tjruwase · 2023-09-27T23:09:46Z

@Liangliang-Ma, it looks like CI issues are resolved, except for formatting. Can you please address that? Thanks!

loadams · 2023-09-27T23:12:58Z

@loadams, thanks for updating the CI ubuntu. I think the package version is tied to the OS version, so we need a way to specify minimum Ubuntu version. Unfortunately, I don't know how to do that. I am not sure we have done this previously.

@mrwyattii, @jeffra any thoughts here? Thanks!

I think we should at a minimum put this in the README in the requirements section? Also perhaps list the failure mode if one doesn't have this?

tjruwase · 2023-09-27T23:33:48Z

I think we should at a minimum put this in the README in the requirements section? Also perhaps list the failure mode if one doesn't have this? Thanks!

Great point @loadams. @Liangliang-Ma, could please replace io_submit with io_pgetevents in the following compatibility checker function?
https://github.com/microsoft/DeepSpeed/blob/28b9d5c2313addde94a1aa17c09eeb97cb792d7e/op_builder/async_io.py#L76-L82

Liangliang-Ma · 2023-09-28T16:00:48Z

@tjruwase Thanks for solving CI issue! Checker function and formatting modification is committed.

Liangliang-Ma · 2023-10-02T12:35:54Z

@tjruwase seems this PR has been blocked by a cpu inference unit test in merge queue, which is out of this PR's scope. Can you please help to check it? Thanks!

loadams · 2023-10-02T16:21:35Z

@tjruwase seems this PR has been blocked by a cpu inference unit test in merge queue, which is out of this PR's scope. Can you please help to check it? Thanks!

Hi @Liangliang-Ma - this is a known issue, should be resolved in this PR then we can unblock the merge queue.

* zero infinity xpu support * remove env var depends * client align mem * sync with all accelerators' * format fix * add align in pin_memory api * add missing brackets * remove align * modify pin_memory api * modify pin_memory api to use only on align para * change value of align bytes * Update csrc/aio/common/deepspeed_aio_common.cpp * add version check and change format --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

zero infinity xpu support

6fc7fcb

Liangliang-Ma requested review from tjruwase, mrwyattii, RezaYazdaniAminabadi, awan-10, jeffra, cmikeh2 and arashb as code owners August 10, 2023 05:54

tjruwase reviewed Aug 10, 2023

View reviewed changes

csrc/aio/py_lib/deepspeed_aio_thread.cpp Outdated Show resolved Hide resolved

tjruwase reviewed Aug 10, 2023

View reviewed changes

csrc/aio/py_lib/deepspeed_py_aio_handle.cpp Outdated Show resolved Hide resolved

tjruwase reviewed Aug 10, 2023

View reviewed changes

csrc/aio/common/deepspeed_aio_common.cpp Outdated Show resolved Hide resolved

remove env var depends

f5f9fc2

Merge branch 'master' into inf

8e4a8af

Liangliang-Ma and others added 4 commits September 19, 2023 15:36

Merge branch 'master' into inf

33673ad

client align mem

033df2d

sync with all accelerators'

acbe13c

format fix

9ee2dae

tjruwase reviewed Sep 22, 2023

View reviewed changes

csrc/aio/common/deepspeed_aio_common.cpp Outdated Show resolved Hide resolved

tjruwase reviewed Sep 22, 2023

View reviewed changes

deepspeed/runtime/swap_tensor/partitioned_param_swapper.py Outdated Show resolved Hide resolved

tjruwase reviewed Sep 22, 2023

View reviewed changes

deepspeed/runtime/swap_tensor/partitioned_param_swapper.py Outdated Show resolved Hide resolved

tjruwase reviewed Sep 22, 2023

View reviewed changes

deepspeed/runtime/swap_tensor/utils.py Outdated Show resolved Hide resolved

change value of align bytes

c481d5d

tjruwase approved these changes Sep 25, 2023

View reviewed changes

tjruwase added 2 commits September 25, 2023 08:33

Merge branch 'master' into inf

0fa7301

Update csrc/aio/common/deepspeed_aio_common.cpp

8c4ffb8

Merge branch 'master' into inf

bceeb36

add version check and change format

87aa088

Merge branch 'master' into inf

e0f2f5b

tjruwase added this pull request to the merge queue Sep 29, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 29, 2023

tjruwase added this pull request to the merge queue Oct 3, 2023

Merged via the queue into deepspeedai:master with commit 1760627 Oct 3, 2023

tjruwase added a commit that referenced this pull request Aug 12, 2024

Avoid xpu regression (#4130)

101a43d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero infinity xpu support #4130

Zero infinity xpu support #4130

Liangliang-Ma commented Aug 10, 2023 •

edited

Loading

Liangliang-Ma commented Aug 24, 2023

Liangliang-Ma commented Aug 28, 2023

tjruwase commented Aug 31, 2023

tjruwase commented Aug 31, 2023

Liangliang-Ma commented Sep 5, 2023

Liangliang-Ma commented Sep 5, 2023

Liangliang-Ma commented Sep 14, 2023

tjruwase commented Sep 14, 2023

tjruwase commented Sep 14, 2023

tjruwase commented Sep 14, 2023

Liangliang-Ma commented Sep 19, 2023 •

edited

Loading

Liangliang-Ma commented Sep 19, 2023 •

edited

Loading

tjruwase commented Sep 19, 2023

tjruwase commented Sep 22, 2023

tjruwase commented Sep 22, 2023

Liangliang-Ma commented Sep 25, 2023

tjruwase commented Sep 27, 2023

loadams commented Sep 27, 2023

tjruwase commented Sep 27, 2023

tjruwase commented Sep 27, 2023

loadams commented Sep 27, 2023

tjruwase commented Sep 27, 2023

Liangliang-Ma commented Sep 28, 2023

Liangliang-Ma commented Oct 2, 2023

loadams commented Oct 2, 2023

Zero infinity xpu support #4130

Zero infinity xpu support #4130

Conversation

Liangliang-Ma commented Aug 10, 2023 • edited Loading

Liangliang-Ma commented Aug 24, 2023

Liangliang-Ma commented Aug 28, 2023

tjruwase commented Aug 31, 2023

tjruwase commented Aug 31, 2023

Liangliang-Ma commented Sep 5, 2023

Liangliang-Ma commented Sep 5, 2023

Liangliang-Ma commented Sep 14, 2023

tjruwase commented Sep 14, 2023

tjruwase commented Sep 14, 2023

tjruwase commented Sep 14, 2023

Liangliang-Ma commented Sep 19, 2023 • edited Loading

Liangliang-Ma commented Sep 19, 2023 • edited Loading

tjruwase commented Sep 19, 2023

tjruwase commented Sep 22, 2023

tjruwase commented Sep 22, 2023

Liangliang-Ma commented Sep 25, 2023

tjruwase commented Sep 27, 2023

loadams commented Sep 27, 2023

tjruwase commented Sep 27, 2023

tjruwase commented Sep 27, 2023

loadams commented Sep 27, 2023

tjruwase commented Sep 27, 2023

Liangliang-Ma commented Sep 28, 2023

Liangliang-Ma commented Oct 2, 2023

loadams commented Oct 2, 2023

Liangliang-Ma commented Aug 10, 2023 •

edited

Loading

Liangliang-Ma commented Sep 19, 2023 •

edited

Loading

Liangliang-Ma commented Sep 19, 2023 •

edited

Loading