[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 #6622

jagadish-amd · 2024-10-11T23:54:13Z

When launching apply_rotary_pos_half kernel, only threads_per_head of 64 is supported for wavefront size of 64.
This change adds support for threads_per_head < 64 such as 4, 8, 16.

Fixes the issue introduced in #5402

When launching apply_rotary_pos_half kernel, only threads_per_head of 64 is supported for wavefront size of 64. This change adds support for threads_per_head < 64 such as 4, 8, 16. Remove the condition to check ROCm and wavefront size check. Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

jagadish-amd · 2024-10-14T19:16:13Z

ping @jithunnair-amd @jeffdaily @loadams

jagadish-amd · 2024-10-17T20:15:05Z

@loadams any comments on this PR?

csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu

jithunnair-amd

LGTM, but let's add a unit test to ensure this functionality can be tested on ROCm (and CUDA)

loadams · 2024-10-30T19:47:05Z

LGTM, but let's add a unit test to ensure this functionality can be tested on ROCm (and CUDA)

@jagadish-amd - thoughts on adding unit tests for this?

jagadish-amd · 2024-10-31T01:45:18Z

LGTM, but let's add a unit test to ensure this functionality can be tested on ROCm (and CUDA)

@jagadish-amd - thoughts on adding unit tests for this?

I will add the unit tests. Thanks

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

jagadish-amd · 2024-11-04T08:09:44Z

LGTM, but let's add a unit test to ensure this functionality can be tested on ROCm (and CUDA)

@jagadish-amd - thoughts on adding unit tests for this?

I will add the unit tests. Thanks

@loadams I have added the test case to test the threads_per_head ,warp size alignment issue.
Unfortunately, I lost access to the AI model / node for which the "Assertion `false' failed" error had triggered on warp_size 64 node. Hence the values in the test cases are assumed here, but it still tests the intended fix. I will add the exact values in the future.
I noticed that the files in unit/ops/transformer/inference were refactored. I have used the InferenceBuilder().load() way to test the function apply_rotary_pos_emb. If this is not right, plz let me know, we can merge the PR without the newly added test. (and later folks can add the test cases, this change affects only warp_size 64 case / AMD Instinct device).

These are the results.
On warp_size = 32, test cases pass regardless of the changes in apply_rotary_pos_emb.cu
==================================== 4 passed, 2 warnings in 25.24s ====================================
On warp_size =64, with the fix.
==================================== 4 passed, 2 warnings in 5.58s =====================================
On warp_size =64, without the fix.
python: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:169: void launch_apply_rotary_pos_emb(T *, T *, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, float, hipStream_t, int) [T = float]: Assertion `false' failed.
Fatal Python error: Aborted

The test run is aborted (as expected) due to the error in kernel. Not sure if there is better way to handle this?

loadams · 2024-11-04T17:09:56Z

LGTM, but let's add a unit test to ensure this functionality can be tested on ROCm (and CUDA)

@jagadish-amd - thoughts on adding unit tests for this?

I will add the unit tests. Thanks

@loadams I have added the test case to test the threads_per_head ,warp size alignment issue. Unfortunately, I lost access to the AI model / node for which the "Assertion `false' failed" error had triggered on warp_size 64 node. Hence the values in the test cases are assumed here, but it still tests the intended fix. I will add the exact values in the future. I noticed that the files in unit/ops/transformer/inference were refactored. I have used the InferenceBuilder().load() way to test the function apply_rotary_pos_emb. If this is not right, plz let me know, we can merge the PR without the newly added test. (and later folks can add the test cases, this change affects only warp_size 64 case / AMD Instinct device).

These are the results. On warp_size = 32, test cases pass regardless of the changes in apply_rotary_pos_emb.cu ==================================== 4 passed, 2 warnings in 25.24s ==================================== On warp_size =64, with the fix. ==================================== 4 passed, 2 warnings in 5.58s ===================================== On warp_size =64, without the fix. python: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:169: void launch_apply_rotary_pos_emb(T *, T *, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, float, hipStream_t, int) [T = float]: Assertion `false' failed. Fatal Python error: Aborted

The test run is aborted (as expected) due to the error in kernel. Not sure if there is better way to handle this?

@jagadish-amd - this should be fine for now. I believe the only remaining thing for this PR is the CLA agreement, you should just need to reply to it with accept and company as AMD.

jagadish-amd · 2024-11-04T17:21:05Z

@jagadish-amd please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="AMD"

jagadish-amd marked this pull request as ready for review October 12, 2024 03:27

jagadish-amd requested a review from awan-10 as a code owner October 12, 2024 03:27

jagadish-amd added 2 commits October 14, 2024 12:16

Merge branch 'master' into rope-kernel-bug

90a036a

Merge branch 'master' into rope-kernel-bug

5de1f78

tjruwase requested review from tjruwase and removed request for awan-10 October 18, 2024 15:45

tjruwase reviewed Oct 18, 2024

View reviewed changes

csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu Show resolved Hide resolved

jagadish-amd added 2 commits October 20, 2024 20:55

Merge branch 'master' into rope-kernel-bug

819fc34

Add support only for warpsize 64

abdf7e0

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

tjruwase approved these changes Oct 21, 2024

View reviewed changes

jithunnair-amd reviewed Oct 21, 2024

View reviewed changes

csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu Show resolved Hide resolved

Merge branch 'master' into rope-kernel-bug

1a066b8

jithunnair-amd reviewed Oct 23, 2024

View reviewed changes

Merge branch 'master' into rope-kernel-bug

0eff67f

loadams approved these changes Oct 23, 2024

View reviewed changes

loadams self-assigned this Oct 28, 2024

jagadish-amd added 2 commits November 3, 2024 23:22

Merge branch 'master' into rope-kernel-bug

1c19fa8

Add test case for Rope

29800b0

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

jagadish-amd requested a review from tohtana as a code owner November 4, 2024 07:48

loadams added 2 commits November 4, 2024 09:01

Fix formatting issues

33db959

Add license

d61ba53

loadams approved these changes Nov 4, 2024

View reviewed changes

loadams added this pull request to the merge queue Nov 4, 2024

Merged via the queue into deepspeedai:master with commit 2b41d62 Nov 4, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 #6622

[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 #6622

jagadish-amd commented Oct 11, 2024 •

edited

Loading

jagadish-amd commented Oct 14, 2024

jagadish-amd commented Oct 17, 2024

jithunnair-amd left a comment

loadams commented Oct 30, 2024

jagadish-amd commented Oct 31, 2024

jagadish-amd commented Nov 4, 2024

loadams commented Nov 4, 2024

jagadish-amd commented Nov 4, 2024

[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 #6622

[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 #6622

Conversation

jagadish-amd commented Oct 11, 2024 • edited Loading

jagadish-amd commented Oct 14, 2024

jagadish-amd commented Oct 17, 2024

jithunnair-amd left a comment

Choose a reason for hiding this comment

loadams commented Oct 30, 2024

jagadish-amd commented Oct 31, 2024

jagadish-amd commented Nov 4, 2024

loadams commented Nov 4, 2024

jagadish-amd commented Nov 4, 2024

jagadish-amd commented Oct 11, 2024 •

edited

Loading