-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA][LIT] Extend test for non-OpenCL #1654
[SYCL][CUDA][LIT] Extend test for non-OpenCL #1654
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the only problem with cuda available in this test, then please move those 100 lines of code checking reqd_wg_size into a new test test.
The rest of the existing test would stay enabled for cuda (it would be about 575-100 lines after separating those 100 lines checking reqd_wg_size).
It isn't the only OpenCL specific part of the test:
I will extract the non-OpenCL tests though it seems that there won't be much to extract. |
Work on this is dealyed due to a higher prority bug and as extracting as many tests as possible requires me to understand better how SYCL on top of different OpenCL versions (1.2 vs 2.x) differs to create a more generic test for non-OpenCL backends (CUDA). |
43ab921
to
d7d9bbc
Compare
As PI is "inspired" by OpenCL, even non-OpenCL PI backends should behave like OpenCL 1.2 in regard to kernel launch error reporting. I have changed the PR to:
To ease checking the logical (non-formatting) changes to the LIT test it will help to look at the separate commits. |
I am not sure why |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good.
I think the Title of this PR should be updated to summarize the new meaning of the patch.
In the buildbot stout output I am seeing: Testing Time: 406.93s
Unsupported : 39
Passed : 319
Expectedly Failed: 6
llvm-lit.py: D:\BuildBot\Product_worker_intel\sycl-win-x64-pr\llvm.obj\bin\..\..\llvm.src\llvm\utils\lit\lit\llvm\config.py:345: note: using clang: d:\buildbot\product_worker_intel\sycl-win-x64-pr\llvm.obj\bin\clang.exe
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:77: note: Backend (SYCL_BE): PI_OPENCL
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:131: note: Found available CPU device
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:157: note: Found available GPU device
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:186: warning: Accelerator device not found
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:200: note: Using opencl-aot version which is built as part of the project
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:215: warning: Couldn't find pre-installed AOT device compiler ocloc
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:212: note: Found pre-installed AOT device compiler aoc
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/Unit/lit.cfg.py:71: note: Backend (SYCL_BE): PI_OPENCL
2 warning(s) in tests
-- Testing: 364 tests, 12 workers --
command timed out: 1200 seconds without output running [b'python3', b'llvm_ci/intel/worker/tools/build.py', b'-n', b'3574', b'-b', b'pull/1654/head', b'-r', b'1654', b'-t', b'check-sycl', b'-p', b'sycl', b'-s', b'lit', b'-P', b'intel/llvm', b'-m', b'sycl-win-x64-pr', b'-e', b'd7d9bbc1c847c79d89347b91ddf98a2598689ee6', b'-U', b'http://ci.llvm.intel.com:8010/#/builders/18/builds/3574'], attempting to kill
program finished with exit code 1
elapsedTime=1882.897134
Testing: 0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. Does that mean that the first set of tests passed for OpenCL? I guess the following run is for Level0 (as this buildbot run does not seem to test CUDA)? Level0 passes for me on Linux when I tested this. How to work around this for now? Should I add a |
It seems so. Actually it's quite confusing. I see that on Linux Level Zero back-end is tested first and OpenCL back-end is second. The Windows buildbot order is reversed.
I think this or |
d7d9bbc
to
e212b19
Compare
I have updated the if-brackets and added |
@bjoernknafla, could you apply clang-format patch, please? |
e212b19
to
9695efb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but there are conflicts in parallel_for_range.cpp that need resolution.
I’ll resolve the conflicts once my keyboard has recovered from spilling water all over it... |
sycl/plugins/cuda/pi_cuda.cpp
Outdated
// Determine local work sizes that result in uniform work groups. | ||
for (size_t i = 0; i < work_dim; i++) { | ||
threadsPerBlock[i] = | ||
std::min(static_cast<int>(maxThreadsPerBlock[i]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that if work_dim > 1, then iterations of the loop at L2310 do not make any useful work for i = 1 and i=2 because at the beginning of the function threadsPerBlock is initialized as {32, 1, 1}. Also, global_work_size[i] is always > 0.
Thus for i = 1 and i = 2:
tmp1=min(threadsPerBlock[i] /* == 1*/, global_work_size[i] /* >0 */); // tmp is always set to 1
threadsPerBlock[i] = min(maxThreadsPerBlock[i], tmp1); // always set to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I wrote this as general code and then left it to keep working should the default values change. This pessimizes performance.
I'll remove the unncessary loop iterations in a way that will make it easy to reintroduce the loop if necessary later - this might look a bit funny (professional term...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the code and added a comment about only processing the first dimension due to the default values.
9695efb
to
c1bc04a
Compare
Clang format checking failued due to a server not being reachable. |
Interesting... it looks like something went wrong with particular this package. |
@romanovvlad, ping. |
c1bc04a
to
22948c0
Compare
The DPCPP runtime relies on piEnqueueKernelLaunch for NDRange parameter validity checks. Add missing checks to the PI CUDA backend. Signed-off-by: Bjoern Knafla <[email protected]>
Rewrite parallel_for_range.cpp test to work with non-OpenCL PI backends that behave like OpenCL 1.2. Level0 testing times out on Windows, mark `windows` as unsupported. Signed-off-by: Bjoern Knafla <[email protected]>
Signed-off-by: Bjoern Knafla <[email protected]>
22948c0
to
f8b72c7
Compare
Mark LIT test that uses cl::reqd_work_group_size as unsupported by
CUDA.
Signed-off-by: Bjoern Knafla [email protected]