-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Some Confusion about nd_range's API】 #1989
Comments
Hi @Chenzejun-Dummkopf, I wish the SYCL spec were more detailed about this. You should probably ask the same question in KhronosGroup/SYCL-Docs so the spec will be improved in future revisions
The first argument is a
The second argument is a
|
Thank you for your significant answers. :) Now i push the same issue to the KhronosGroup and hope the KhronosGroup can explicit this point in the next version of SYCL spec. |
Now i have another questions about the DPCPP, which is mainly based on the SYCL. queue q; I find that the class handler has no API called 'depengs_on()' from the SYCL 1.2 spec. |
Yes, everything seems to be correct
Ues, DPC++ is based on SYCL 1.2.1, but it brings several extensions to it to leverage different HW capabilities and useful functionality. You can find more details and full list of extensions in oneAPI spec This one is from USM extension
This is enough. Right now all extensions are enabled by default, so |
Thank you for your attention and efforts! Here is some illusration about work-items in ND-range from book about DPCPP. 'Second, although the work-items in a work-group are scheduled concurrently, they are not guaranteed to make independent forward progress — executing the work-items within a work-group sequentially between barriers and collectives is a valid implementation.'
|
I think it would be better to answer to your questions in reverse order:
If I understand correctly, work-items of your kernel are executed somehow. I.e. it is implementation defined whether they all will be executed by a single compute unit sequentially, or by a several compute units in parallel. Again, even if several compute units were used to execute some amount of work-items, some of them might be still executed sequentially. So, no ordering is defined at all: even for sequential execution implementation is free to start from the end of an ND-range (max value of global id) and not from the beginning (global id = 0) I will try to illustrate that, but I'm afraid that my examples won't be the best ones. Tagging @Pennycook here - I believe John is able to explain it much better than me // Sorry, it is easier for me to write OpenCL C code than SYCL code, but I hope the idea I want to express is clear
__kernel void test(__global int *a, __global int *b) {
int id = get_global__id(0);
if (id != 0)
a[id] = b[id] + 3; // perform some calculations
else
for (int i = 0; i < get_local_size(0); ++i)
a[0] += a[i]; // This code is incorrect. There is no guarantee that work-item "i" has been executed already
}
In the example above I tried to implement some kind of reduction algorithm: elements of
Example above which is rewritten with __kernel void test(__global int *a, __global int *b) {
int id = get_global__id(0);
if (id != 0)
a[id] = b[id] + 3; // perform some calculations
barrier(); // you have guarantee that each work-item will proceed to execute kernel further *only* when all work-items in a work-group hit this barrier call
if (id == 0)
for (int i = 0; i < get_local_size(0); ++i)
a[0] += a[i]; // This code is correct. We have a barrier above, which means that all work-items already completed calculations at the beginning of a kernel
}
// Note: code above is not meant to be performant, it just shows the idea of `barrier` built-in Hope this helps. @Pennycook, please correct me if I'm wrong somewhere or my explanations/examples are unclear |
@AlexeySachkov's explanation is great, so I don't have much to add here. There's also a subtle difference between concurrency and parallelism as used in the specification and the book, which may be the cause of some confusion. When we say that the work-items in one group are executed concurrently, we really just mean that a runtime must schedule them in a way that allows for cooperation as in Alexey's examples. It wouldn't be valid to try and execute the whole kernel for each work-item sequentially, because the first work-item to hit a barrier would wait at the barrier forever -- the other work-items aren't running. But it would be valid to execute each section of the kernel between barriers, and to switch between which work-item is being executed whenever a barrier is encountered. One way I like to think of this is by imagining work-items as fibers or co-routines, where each barrier or collective acts like a As for an example of the collectives, they're really just shorthands for common patterns that require barriers on entry and exit. Building on Alexey's summation example and switching to DPC++ syntax: // Each work-item in the group has a value x to contribute to a work-group sum
int lid = it.get_local_id();
partial[lid] = x;
// Barrier before the reduction ensures the partial results are visible to all other work-items in the work-group
it.barrier();
for (int i = 1; i < it.get_local_range()[0]; ++i) {
partial[0] += partial[i];
}
// Barrier after the reduction ensures the final sum is visible to all work-items in the work-group
// partial[0] contains the final sum
it.barrier(); There's a barrier required at the beginning of the reduction and at the end, and some computation happens between the barriers to combine the results. There are more efficient ways to implement this combination step, but we can ignore that, because DPC++ provides a library function for this pattern: // Each work-item in the group has a value x to contribute to a work-group sum
// sum contains the final sum
float sum = reduce(it.get_group(), x, plus<>()); A correct implementation of |
Thank both of you so much for your meaningful answers! :)
Thank you for your attention! |
General DCP++/SYCL user question are better asked at |
…tel#1990) Add the library `BINARY_DIR` to `LD_LIBRARY_PATH` to ensure that the freshly built `libLLVMSPIRVLib.so` is tested. Otherwise, llvm-spirv spawned by the test suite may use the previously installed `libLLVMSPIRVLib.so`. I have noticed the problem after rebuilding LLVM with `-DLLVM_ENABLE_ASSSERTIONS=ON`. This meant that the previous version of `libLLVMSPIRVLib.so` now crashed, effectively causing the test suite to fail incorrectly. Signed-off-by: Michał Górny <[email protected]> Original commit: KhronosGroup/SPIRV-LLVM-Translator@ba965cd
I find some examples about the SYCL‘s class nd_range.
I am confused about some APIs like 'get_global_range' , 'get_local_range', 'get_group_range' and 'get_offset'.
How to understand the global range, local range and group range?
Does the global range mean the range of the work group?
Does the local range mean the range of the work item?
What does the group range mean?
`cl::sycl::nd_range<3> three_dim_nd_range({32, 64, 128}, {16, 32, 64} );
assert(three_dim_nd_range.get_global_range() == cl::sycl::range<3>(32, 64, 128));
assert(three_dim_nd_range.get_local_range() == cl::sycl::range<3>(16, 32, 64));
assert(three_dim_nd_range.get_group_range() == cl::sycl::range<3>(2, 2, 2));
assert(three_dim_nd_range.get_offset() == cl::sycl::id<3>(0, 0, 0));
`
The text was updated successfully, but these errors were encountered: