Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CCLBackend] Using parallel memcpy for inference_all_reduce #4404

Merged
merged 5 commits into from
Oct 3, 2023

Conversation

delock
Copy link
Collaborator

@delock delock commented Sep 26, 2023

This PR introduce a parallel version of memcpy and use it in inference_all_reduce. This allows memcpy fully utilize host memory bandwidth and improve performance. For some typical message size, this PR improve their performance as follows: (Tested on 2 socket 4th Gen Xeon Scalable system)
32KB: 26us -->23us
128KB: 59us --> 29us
512KB: 210us --> 40us

Besides, we improved handling of large message size. Now a fixed size shared memory can support any message size. This would help first token of a long input sequence.

@tjruwase tjruwase added this pull request to the merge queue Oct 3, 2023
Merged via the queue into deepspeedai:master with commit 9a55291 Oct 3, 2023
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Oct 9, 2023
…dai#4404)

* use parallel version of memcpy

* include max buf size to 16MB per rank

* support any input buffer size

* fix format error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants