-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancements] Several bugfixes and refactoring of dynamic generic reduction #1156
Conversation
git-subtree-dir: src/composable_kernel git-subtree-split: f6edda6119ebbb237dfa6270797b34f960d7b190
…le_kernel_init_integration_v3
…init_integration_v3
…ernel wrapper files and simplify reduce_all kernel wrappers
Commit 21cbb is a big commit, it does the following
|
Some Correction to the above comments:
|
No need. Since
|
@qianfengz Okay.
Is PR ready for review/testing/merge? |
Yes, please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code-quality related nitpicks.
src/composable_kernel/composable_kernel/include/utility/reduction_operator.hpp
Show resolved
Hide resolved
const void* p_src2dDesc = ws_global; | ||
const void* p_dst1dDesc = static_cast<char*>(ws_global) + 2048; | ||
const void* p_src2dDesc = cast_pointer_to_generic_address_space(ws_global); | ||
const void* p_dst1dDesc = static_cast<const char*>(p_src2dDesc) + 2048; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question] Why 2048?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an implicit convention between two kernels to use one device page to store the two descriptors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define a symbol (a macro would be fine) and use that symbol instead of literal value. Please also put a comment which explains the convention nearby the definition of a symbol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping. I'm worrying about the clarity of the code. Or this is documented elsewhere already? --
This is just an implicit convention between two kernels to use one device page to store the two descriptors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define a symbol (a macro would be fine) and use that symbol instead of literal value. Please also put a comment which explains the convention nearby the definition of a symbol.
I will add some comments in next P.R
make_naive_tensor_descriptor(ref_tupleDstLengths, ref_tupleDstLengths); | ||
|
||
static constexpr index_t ref_invariantLen = ref_dstDesc.GetLength(Number<0>{}); | ||
static constexpr index_t ref_toReduceLen = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question] Why 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any positive integer constant should be ok here. The compiler-time codes here is to get the descriptor object's type, so that we can copy the descriptor object generated by the "preparing kernel". According to @asroy , the value of the constant here does not affect the descriptor object's type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires some explanation. Can you please use the same approach that I requested here, thanks.
...kernel/src/kernel_wrapper/gridwise_generic_reduction_first_call_warpwise_reduce_all_dims.cpp
Show resolved
Hide resolved
I recommend improving clarity of the code. For example, adding explanations for all the non-trivial literals in a program. @qianfengz It's up to you. I will approve this now. |
Thanks. I will add some comments in next P.R |
Let's wrap up this PR for now :) Priority has shifted and we can revisit reduction only after the urgent task. |
646fcc268 Merge pull request #47 from ROCmSoftwarePlatform/develop 6014185ac [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44) 3e9113707 Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all 211dae822 Merge branch 'develop' into miopen_downstream_all 5890e3007 [Composable Kernel] update develop branch code to ck_upstream d5297abae fix bug in gridwise gemm xdlops v2r3 (#45) 38a90b6ed Merge pull request #43 from ROCmSoftwarePlatform/develop c3018794b bug fix (#39) fd49ff808 add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) b2dc55f82 [MIOpen Downstream] Fix Reduction Kernel (#34) b3e8d57d5 Tweak GEMM kernel (#38) 846f462bd Add VectorType support into StaticBuffer (#27) dfb80c4e3 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156) 8557901d0 Merge pull request #1165 from ROCmSoftwarePlatform/develop f305bebdc Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr b725e3fc8 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr 88833bd9a Merge pull request #32 from ROCmSoftwarePlatform/develop df0d68106 :Merge remote-tracking branch 'origin/develop' into CK_upstream f3acd2510 Add a version of Merge transform that use integerdivision and mod (#25) 19613902b GEMM driver and kernel (#29) 627d8ef35 Backward weight v4r4r2 with xdlops (#18) 10bb81106 Misc fixes (#24) 9e80cdceb [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108) a7a758d8c GlobalAtomicAdd for fp32/int32 (#23) 9d3f634a3 Xdlops refactor fix (#22) c6f26bb48 magic division use __umulhi() (#19) 6fe3627a9 Composable kernel init integration v3 (#1097) a2ad6d353 refactor dynamic xdlops iGemm (#13) ba6f79a75 Added host_conv_wrw for verification (#15) git-subtree-dir: src/composable_kernel git-subtree-split: 646fcc268ede841a16cdaafb68aa64803d8390e1
…duction (#1156) * Squashed 'src/composable_kernel/' content from commit f6edda611 git-subtree-dir: src/composable_kernel git-subtree-split: f6edda6119ebbb237dfa6270797b34f960d7b190 * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files * Squashed 'src/composable_kernel/' changes from f6edda611..5781adf5c 5781adf5c Update develop (#5) (#6) 97e6d514f Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile 7b1ec41e5 refactor 49c33aaea refactor 54b3e73d1 rename git-subtree-dir: src/composable_kernel git-subtree-split: 5781adf5cf4ac753e2e36da7385791775b744bf7 * fix * refactor * remove online compilation from CK * refactor * fix * add ctest * tidy * add tidy * tidy * tidy * tidy * tidy * tidy * tidy * tidy * tidy * tidy * add c-style pointer cast * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast * fix clang warning suppression * tidy * suppress cppcheck * fix enum issue * revert chagnes to hip build * fix kernel filename * update CK build script * rename * rename * make innner product compatiable on gfx900 * Update src/include/miopen/solver/ck_utility_common.hpp Co-authored-by: JD <[email protected]> * compiler parameter use stream * use int instead of index_t in kernel wrapper * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element * refactor * refactor * change cmakelist * change ck common utility * fix * Squashed 'src/composable_kernel/' changes from 5781adf5c..31b403526 31b403526 Merge pull request #16 from ROCmSoftwarePlatform/develop b62bf8c3f Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration ccc4a1d36 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration 67ad47e7c refactor 16effa767 refactor a91b68dfc DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element 2cbabbba5 use int instead of index_t in kernel wrapper 0834bc763 compiler parameter use stream f2ac7832c make innner product compatiable on gfx900 4e57b30a6 rename c03045ce2 rename b2589957f update CK build script 2c48039d0 fix kernel filename d626dccc9 fix enum issue 643ebd4f3 tidy ddd49ec9e fix clang warning suppression 4f566c622 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast 172036d72 add c-style pointer cast 76f313193 tidy d18428901 tidy f885c131d tidy 80120f0a0 tidy c3efeb5e2 tidy 56fc0842b tidy 54fba515b tidy e62bae7a4 tidy 24c872894 add tidy 61487e0a0 fix ae98b52ad remove online compilation from CK cb9542131 refactor 73ca97015 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3 3b8664611 Merge pull request #7 from ROCmSoftwarePlatform/master d09ea4f4e Update develop (#5) 3d32ae940 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files git-subtree-dir: src/composable_kernel git-subtree-split: 31b403526ec54abf13c4bb58dfb6635b4d2aa619 * Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel * Fix with regard to implementing GetZeroVal() in both kernel and host * Avoid convert to compType from dstDataType before writting the output value * Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator * Add CONSTANT decorator for descriptor read buffer * Use get_thread_local_1d_id() for thread local Id * Rename GetZeroVal() to GetReductionZeroVal() in the kernels * Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp * Occasional tiny simplification and update in the kernel files * Update in src/reducetensor.cpp for consistent IDs passing to the kernel * Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers * Update to remove OpenCL tidy checking failures * Small updates in src/reducetensor.cpp * Update for better readability * Remove unused codes and not-needed template parameters in the kernel wrappers Co-authored-by: Chao Liu <[email protected]> Co-authored-by: JD <[email protected]>
This P.R fixes several visible bugs in dynamic generic reduction, and commits are also added to refactor the codes following the review from 54615.
Here are my explanations about three commits
dstDataType
is converted tocompType
before writing out, this could cause losing of precision whensrcDataType
andcompType
are both half, while thedstDataType
is float, even though this situation might not be common,miopenReduceTensor()
does support itThe commits have passed the following testing locally on MI100
bin/test_reduce_test --all
bin/test_reduce_test --all --half
bin/test_reduce_test --all --double
MIOpenDriver script 1 for non-indexable operations
MIOpenDriver script 2 for indexable operations