Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Some Confusion about nd_range's API】 #1989

Closed
ddummkopfer opened this issue Jun 26, 2020 · 9 comments
Closed

【Some Confusion about nd_range's API】 #1989

ddummkopfer opened this issue Jun 26, 2020 · 9 comments

Comments

@ddummkopfer
Copy link

ddummkopfer commented Jun 26, 2020

I find some examples about the SYCL‘s class nd_range.
I am confused about some APIs like 'get_global_range' , 'get_local_range', 'get_group_range' and 'get_offset'.
How to understand the global range, local range and group range?
Does the global range mean the range of the work group?
Does the local range mean the range of the work item?
What does the group range mean?

`cl::sycl::nd_range<3> three_dim_nd_range({32, 64, 128}, {16, 32, 64} );

assert(three_dim_nd_range.get_global_range() == cl::sycl::range<3>(32, 64, 128));

assert(three_dim_nd_range.get_local_range() == cl::sycl::range<3>(16, 32, 64));

assert(three_dim_nd_range.get_group_range() == cl::sycl::range<3>(2, 2, 2));

assert(three_dim_nd_range.get_offset() == cl::sycl::id<3>(0, 0, 0));
`

@AlexeySachkov
Copy link
Contributor

Hi @Chenzejun-Dummkopf,

I wish the SYCL spec were more detailed about this. You should probably ask the same question in KhronosGroup/SYCL-Docs so the spec will be improved in future revisions

How to understand the global range, local range and group range?

nd_range is inherited from OpenCL ND-range execution model and if you are not yet familiar with it, I recommend reading OpenCL 1.2 spec, Section 3.2 Execution Model.

Does the global range mean the range of the work group?

The first argument is a range, which describes total amount of work-items that should execute a kernel and get_global_range can be used to retrieve that range.

Does the local range mean the range of the work item?

The second argument is a range, which describes amount/shape of work-items that form a work-group - it can be retrieved via get_local_range.

What does the group range mean?

get_group_range returns a range corresponding to amount of work-groups that execute a kernel - it can be calculated as get_global_range() / get_local_range() - total amount of work-items in each dimension is equally divided by work-group size for that dimension to calculate amount of work-groups in the ND-range

@ddummkopfer
Copy link
Author

ddummkopfer commented Jun 27, 2020

Thank you for your significant answers. :)
Now i understand the meaning of the nd-range's API.
For the create API 'three_dim_nd_range', the first argument 'global_range' means the range of the total work-item.
So the {32, 64, 128} means the amount of the work-items is 32x64x128 = 262144.
The second argument 'local_range' means the range of the work-item from a single work-group.
So the {16, 32, 64} means the amount of the work-items from a single work-group is 16x32x64 = 32768.
So the work-group's amount is 262144/32768 = 8. There are 8 work-groups. For every dimension, it is (2,2,2).
Is my understanding right?

Now i push the same issue to the KhronosGroup and hope the KhronosGroup can explicit this point in the next version of SYCL spec.
Thank you so much!

@ddummkopfer
Copy link
Author

ddummkopfer commented Jun 29, 2020

Now i have another questions about the DPCPP, which is mainly based on the SYCL.
Here is a DPCPP code:

queue q;
auto e1 = q.memcpy(d_a, a, sizeof(int)*N);
auto e2 = q.submit([&](handler &h) {
h.depends_on(e1);
});

I find that the class handler has no API called 'depengs_on()' from the SYCL 1.2 spec.
Class queue has no API called 'memcpy' from the SYCL 1.2 spec.
Does the DPCPP use some extention?
Another question, how to compile the DPCPP code?
I use clang++ -fsycl to compile the SYCL code.

@AlexeySachkov
Copy link
Contributor

For the create API 'three_dim_nd_range', the first argument 'global_range' means the range of the total work-item.
So the {32, 64, 128} means the amount of the work-items is 32x64x128 = 262144.
The second argument 'local_range' means the range of the work-item from a single work-group.
So the {16, 32, 64} means the amount of the work-items from a single work-group is 16x32x64 = 32768.
So the work-group's amount is 262144/32768 = 8. There are 8 work-groups. For every dimension, it is (2,2,2).
Is my understanding right?

Yes, everything seems to be correct

Now i have another questions about the DPCPP, which is mainly based on the SYCL.

I find that the class handler has no API called 'depengs_on()' from the SYCL 1.2 spec.
Class queue has no API called 'memcpy' from the SYCL 1.2 spec.
Does the DPCPP use some extention?

Ues, DPC++ is based on SYCL 1.2.1, but it brings several extensions to it to leverage different HW capabilities and useful functionality. You can find more details and full list of extensions in oneAPI spec

This one is from USM extension

Another question, how to compile the DPCPP code?
I use clang++ -fsycl to compile the SYCL code.

This is enough. Right now all extensions are enabled by default, so clang++ -fsycl will compile DPCPP just fine. There is an activity add more controls (#806), but it hasn't implemented yet

@ddummkopfer
Copy link
Author

ddummkopfer commented Jun 30, 2020

Thank you for your attention and efforts!

Here is some illusration about work-items in ND-range from book about DPCPP.

'Second, although the work-items in a work-group are scheduled concurrently, they are not guaranteed to make independent forward progress — executing the work-items within a work-group sequentially between barriers and collectives is a valid implementation.'

  1. How to understand the word 'barriers' and 'collectives' here?

  2. For example, i assume that i have totaly 32 computation unit in my single dGPU. Now i want to use dpcpp to compute an array, which size is 40. So how the dpcpp guarantees the 40 elements in this array are executed concurrently? Or the a[0] ~ a[31] are executed concurrently and the a[32] ~ a[39] are executed secondly?

@AlexeySachkov
Copy link
Contributor

I think it would be better to answer to your questions in reverse order:

For example, i assume that i have totaly 32 computation unit in my single dGPU. Now i want to use dpcpp to compute an array, which size is 40. So how the dpcpp guarantees the 40 elements in this array are executed concurrently? Or the a[0] ~ a[31] are executed concurrently and the a[32] ~ a[39] are executed secondly?

If I understand correctly, work-items of your kernel are executed somehow. I.e. it is implementation defined whether they all will be executed by a single compute unit sequentially, or by a several compute units in parallel. Again, even if several compute units were used to execute some amount of work-items, some of them might be still executed sequentially. So, no ordering is defined at all: even for sequential execution implementation is free to start from the end of an ND-range (max value of global id) and not from the beginning (global id = 0)

I will try to illustrate that, but I'm afraid that my examples won't be the best ones. Tagging @Pennycook here - I believe John is able to explain it much better than me

// Sorry, it is easier for me to write OpenCL C code than SYCL code, but I hope the idea I want to express is clear
__kernel void test(__global int *a, __global int *b) {
  int id = get_global__id(0);
  if (id != 0)
    a[id] = b[id] + 3; // perform some calculations
  else
    for (int i = 0; i < get_local_size(0); ++i)
      a[0] += a[i]; // This code is incorrect. There is no guarantee that work-item "i" has been executed already
}

How to understand the word 'barriers' and 'collectives' here?

In the example above I tried to implement some kind of reduction algorithm: elements of a that correspond to first work-item in each work-group contains partial sum of elements calculated by other work-items: note that I want work-items to co-operate and share results and wait for each other, but order is not guaranteed, but there is a way to perform some level of synchronization between work-items:

  • barrier is a special function which allow to guarantee that all work-items reached some point in a kernel
  • collectives are algorithms, that provide some possibility to perform some operations by several work-items "synchronously", like reduction of values from different work-items in a work-group. I will let @Pennycook to comment here and share some good examples

Example above which is rewritten with barrier usage to be correct:

__kernel void test(__global int *a, __global int *b) {
  int id = get_global__id(0);
  if (id != 0)
    a[id] = b[id] + 3; // perform some calculations

  barrier(); // you have guarantee that each work-item will proceed to execute kernel further *only* when all work-items in a work-group hit this barrier call
  if (id == 0)
    for (int i = 0; i < get_local_size(0); ++i)
      a[0] += a[i]; // This code is correct. We have a barrier above, which means that all work-items already completed calculations at the beginning of a kernel
}
// Note: code above is not meant to be performant, it just shows the idea of `barrier` built-in

Hope this helps. @Pennycook, please correct me if I'm wrong somewhere or my explanations/examples are unclear

@Pennycook
Copy link
Contributor

@AlexeySachkov's explanation is great, so I don't have much to add here.

There's also a subtle difference between concurrency and parallelism as used in the specification and the book, which may be the cause of some confusion. When we say that the work-items in one group are executed concurrently, we really just mean that a runtime must schedule them in a way that allows for cooperation as in Alexey's examples. It wouldn't be valid to try and execute the whole kernel for each work-item sequentially, because the first work-item to hit a barrier would wait at the barrier forever -- the other work-items aren't running. But it would be valid to execute each section of the kernel between barriers, and to switch between which work-item is being executed whenever a barrier is encountered. One way I like to think of this is by imagining work-items as fibers or co-routines, where each barrier or collective acts like a yield statement.

As for an example of the collectives, they're really just shorthands for common patterns that require barriers on entry and exit. Building on Alexey's summation example and switching to DPC++ syntax:

// Each work-item in the group has a value x to contribute to a work-group sum
int lid = it.get_local_id();
partial[lid] = x;

// Barrier before the reduction ensures the partial results are visible to all other work-items in the work-group
it.barrier();
for (int i = 1; i < it.get_local_range()[0]; ++i) {
  partial[0] += partial[i];
}

// Barrier after the reduction ensures the final sum is visible to all work-items in the work-group
// partial[0] contains the final sum
it.barrier();

There's a barrier required at the beginning of the reduction and at the end, and some computation happens between the barriers to combine the results. There are more efficient ways to implement this combination step, but we can ignore that, because DPC++ provides a library function for this pattern:

// Each work-item in the group has a value x to contribute to a work-group sum
// sum contains the final sum
float sum = reduce(it.get_group(), x, plus<>());

A correct implementation of reduce relies on it having the same execution model guarantees as a barrier. All the work-items must be present to contribute their value to the reduction, and they must wait for all other work-items to encounter the function and complete the reduction before they can proceed past it. That's why the execution model section of the book talks about collectives as well as barriers. You can read more about the other collectives supported in DPC++ here.

@ddummkopfer
Copy link
Author

Thank both of you so much for your meaningful answers! :)

  1. Now i understand that the 'barrier' is like a 'gate'. All the work-items one after another reach the closed gate, then the gate will open. It is used for synchronization. Is my undertanding right?

  2. 'Concurrency' means the runtime will schedule the work-items sequentially. 'Concurrency' can't guarantee that all the work-items within a single work-group will be executed sequentially. Is my undertanding right?

  3. Another question, how to let the work-items in a single work-group be executed in the order we arranged? Is there some examples written in DPCPP?

  4. Where can i get the official DPCPP examples for learning?

Thank you for your attention!

@rolandschulz
Copy link
Contributor

kchusha pushed a commit to kchusha/llvm that referenced this issue May 10, 2023
…tel#1990)

Add the library `BINARY_DIR` to `LD_LIBRARY_PATH` to ensure that
the freshly built `libLLVMSPIRVLib.so` is tested.  Otherwise, llvm-spirv
spawned by the test suite may use the previously installed
`libLLVMSPIRVLib.so`.

I have noticed the problem after rebuilding LLVM with
`-DLLVM_ENABLE_ASSSERTIONS=ON`.  This meant that the previous version
of `libLLVMSPIRVLib.so` now crashed, effectively causing the test suite
to fail incorrectly.

Signed-off-by: Michał Górny <[email protected]>

Original commit:
KhronosGroup/SPIRV-LLVM-Translator@ba965cd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants