Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA][LIT] Extend test for non-OpenCL #1654

Conversation

bjoernknafla
Copy link
Contributor

Mark LIT test that uses cl::reqd_work_group_size as unsupported by
CUDA.

Signed-off-by: Bjoern Knafla [email protected]

@bjoernknafla bjoernknafla requested a review from a team as a code owner May 7, 2020 14:44
@bader bader added the cuda CUDA back-end label May 7, 2020
Copy link
Contributor

@v-klochkov v-klochkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the only problem with cuda available in this test, then please move those 100 lines of code checking reqd_wg_size into a new test test.

The rest of the existing test would stay enabled for cuda (it would be about 575-100 lines after separating those 100 lines checking reqd_wg_size).

@v-klochkov v-klochkov requested a review from bader May 7, 2020 19:05
@bjoernknafla
Copy link
Contributor Author

It isn't the only OpenCL specific part of the test:

  • many tests build the program via .build_with_kernel_type with a -cl-std=CL2.0 flag,
  • other tests check for the device version info which does a hard-coded check for OpenCL (if (OCLVersion...).

I will extract the non-OpenCL tests though it seems that there won't be much to extract.

@bjoernknafla
Copy link
Contributor Author

bjoernknafla commented May 28, 2020

Work on this is dealyed due to a higher prority bug and as extracting as many tests as possible requires me to understand better how SYCL on top of different OpenCL versions (1.2 vs 2.x) differs to create a more generic test for non-OpenCL backends (CUDA).

@bjoernknafla bjoernknafla force-pushed the bjoern/cuda-does-not-support-reqd-work-group-size branch from 43ab921 to d7d9bbc Compare July 24, 2020 19:26
@bjoernknafla bjoernknafla requested a review from a team as a code owner July 24, 2020 19:26
@bjoernknafla
Copy link
Contributor Author

As PI is "inspired" by OpenCL, even non-OpenCL PI backends should behave like OpenCL 1.2 in regard to kernel launch error reporting.

I have changed the PR to:

  • Add more error detection and reporting to PI CUDA.
  • Adapt the parallel_for_range.cpp LIT tests to work with non-OpenCL PI backends. This way we can avoid duplicating a lot of test code that is tightly coupled to sycl/source/detail/error_handling/enqueue_kernel.cpp#L25.
  • Clang-formatted the LIT test in a separate commit.

To ease checking the logical (non-formatting) changes to the LIT test it will help to look at the separate commits.

bader
bader previously approved these changes Jul 26, 2020
@bjoernknafla
Copy link
Contributor Author

I am not sure why buildbot/sycl-win-x64-pr failed and what to do about it - it seemed to have timed out but it produced the test results.

@romanovvlad
Copy link
Contributor

I am not sure why buildbot/sycl-win-x64-pr failed and what to do about it - it seemed to have timed out but it produced the test results.
There are several LIT runs for different backends. So, testing passed for one BE, but "hangs" for another. I restarted the job and it failed again, so I believe there is some bug in your patch.

v-klochkov
v-klochkov previously approved these changes Jul 29, 2020
Copy link
Contributor

@v-klochkov v-klochkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good.
I think the Title of this PR should be updated to summarize the new meaning of the patch.

@bjoernknafla
Copy link
Contributor Author

In the buildbot stout output I am seeing:

Testing Time: 406.93s
  Unsupported      :  39
  Passed           : 319
  Expectedly Failed:   6
llvm-lit.py: D:\BuildBot\Product_worker_intel\sycl-win-x64-pr\llvm.obj\bin\..\..\llvm.src\llvm\utils\lit\lit\llvm\config.py:345: note: using clang: d:\buildbot\product_worker_intel\sycl-win-x64-pr\llvm.obj\bin\clang.exe
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:77: note: Backend (SYCL_BE): PI_OPENCL
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:131: note: Found available CPU device
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:157: note: Found available GPU device
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:186: warning: Accelerator device not found
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:200: note: Using opencl-aot version which is built as part of the project
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:215: warning: Couldn't find pre-installed AOT device compiler ocloc
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/lit.cfg.py:212: note: Found pre-installed AOT device compiler aoc
llvm-lit.py: D:/BuildBot/Product_worker_intel/sycl-win-x64-pr/llvm.src/sycl/test/Unit/lit.cfg.py:71: note: Backend (SYCL_BE): PI_OPENCL
2 warning(s) in tests
-- Testing: 364 tests, 12 workers --
command timed out: 1200 seconds without output running [b'python3', b'llvm_ci/intel/worker/tools/build.py', b'-n', b'3574', b'-b', b'pull/1654/head', b'-r', b'1654', b'-t', b'check-sycl', b'-p', b'sycl', b'-s', b'lit', b'-P', b'intel/llvm', b'-m', b'sycl-win-x64-pr', b'-e', b'd7d9bbc1c847c79d89347b91ddf98a2598689ee6', b'-U', b'http://ci.llvm.intel.com:8010/#/builders/18/builds/3574'], attempting to kill
program finished with exit code 1
elapsedTime=1882.897134
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..

Does that mean that the first set of tests passed for OpenCL?

I guess the following run is for Level0 (as this buildbot run does not seem to test CUDA)? Level0 passes for me on Linux when I tested this. How to work around this for now? Should I add a REQUIRES: linux?

@bjoernknafla bjoernknafla changed the title [SYCL][CUDA][LIT] reqd_wg_size unsupported [SYCL][CUDA][LIT] Extend test for non-OpenCL Jul 30, 2020
@bader
Copy link
Contributor

bader commented Jul 30, 2020

Does that mean that the first set of tests passed for OpenCL?

It seems so. Actually it's quite confusing. I see that on Linux Level Zero back-end is tested first and OpenCL back-end is second. The Windows buildbot order is reversed.

I guess the following run is for Level0 (as this buildbot run does not seem to test CUDA)? Level0 passes for me on Linux when I tested this. How to work around this for now? Should I add a REQUIRES: linux?

I think this or UNSUPPORTED: windows should work.

@bjoernknafla bjoernknafla dismissed stale reviews from v-klochkov and bader via e212b19 July 30, 2020 10:10
@bjoernknafla bjoernknafla force-pushed the bjoern/cuda-does-not-support-reqd-work-group-size branch from d7d9bbc to e212b19 Compare July 30, 2020 10:10
@bjoernknafla
Copy link
Contributor Author

I have updated the if-brackets and added UNSUPPORTED: windows to the test that hangs with Level0 on Windows.

@bader
Copy link
Contributor

bader commented Jul 30, 2020

@bjoernknafla, could you apply clang-format patch, please?

@bjoernknafla bjoernknafla force-pushed the bjoern/cuda-does-not-support-reqd-work-group-size branch from e212b19 to 9695efb Compare July 30, 2020 11:30
bader
bader previously approved these changes Jul 30, 2020
v-klochkov
v-klochkov previously approved these changes Jul 30, 2020
Copy link
Contributor

@v-klochkov v-klochkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but there are conflicts in parallel_for_range.cpp that need resolution.

@bjoernknafla
Copy link
Contributor Author

I’ll resolve the conflicts once my keyboard has recovered from spilling water all over it...

// Determine local work sizes that result in uniform work groups.
for (size_t i = 0; i < work_dim; i++) {
threadsPerBlock[i] =
std::min(static_cast<int>(maxThreadsPerBlock[i]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that if work_dim > 1, then iterations of the loop at L2310 do not make any useful work for i = 1 and i=2 because at the beginning of the function threadsPerBlock is initialized as {32, 1, 1}. Also, global_work_size[i] is always > 0.
Thus for i = 1 and i = 2:
tmp1=min(threadsPerBlock[i] /* == 1*/, global_work_size[i] /* >0 */); // tmp is always set to 1
threadsPerBlock[i] = min(maxThreadsPerBlock[i], tmp1); // always set to 1.

Copy link
Contributor Author

@bjoernknafla bjoernknafla Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I wrote this as general code and then left it to keep working should the default values change. This pessimizes performance.

I'll remove the unncessary loop iterations in a way that will make it easy to reintroduce the loop if necessary later - this might look a bit funny (professional term...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the code and added a comment about only processing the first dimension due to the default values.

@bjoernknafla bjoernknafla dismissed stale reviews from v-klochkov and bader via c1bc04a July 31, 2020 11:48
@bjoernknafla bjoernknafla force-pushed the bjoern/cuda-does-not-support-reqd-work-group-size branch from 9695efb to c1bc04a Compare July 31, 2020 11:48
@bjoernknafla
Copy link
Contributor Author

Clang format checking failued due to a server not being reachable.

bader
bader previously approved these changes Jul 31, 2020
@bader
Copy link
Contributor

bader commented Jul 31, 2020

Clang format checking failued due to a server not being reachable.

Interesting... it looks like something went wrong with particular this package.
#2230 - clang-format-9 seems to work okay.

@bader bader requested a review from v-klochkov July 31, 2020 13:59
@bader
Copy link
Contributor

bader commented Jul 31, 2020

@romanovvlad, ping.

@bjoernknafla bjoernknafla force-pushed the bjoern/cuda-does-not-support-reqd-work-group-size branch from c1bc04a to 22948c0 Compare July 31, 2020 16:31
The DPCPP runtime relies on piEnqueueKernelLaunch for NDRange parameter
validity checks. Add missing checks to the PI CUDA backend.

Signed-off-by: Bjoern Knafla <[email protected]>
Rewrite parallel_for_range.cpp test to work with non-OpenCL PI backends
that behave like OpenCL 1.2.

Level0 testing times out on Windows, mark `windows` as unsupported.

Signed-off-by: Bjoern Knafla <[email protected]>
@bjoernknafla bjoernknafla force-pushed the bjoern/cuda-does-not-support-reqd-work-group-size branch from 22948c0 to f8b72c7 Compare July 31, 2020 17:10
@bader bader merged commit 6909c06 into intel:sycl Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants