-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA] Fix LIT fails on machines without non-NVIDIA OpenCL #1613
[SYCL][CUDA] Fix LIT fails on machines without non-NVIDIA OpenCL #1613
Conversation
sycl/test/Unit/lit.cfg.py
Outdated
|
||
config.environment['SYCL_BE'] = lit_config.params.get('SYCL_BE', "PI_OPENCL") | ||
|
||
lit_config.note("Environment: {}".format(config.environment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a better way to print the implicit environment with which LIT runs the unit tests, e.g., one that allows a copy and paste approach to setting up the environment locally to investiage fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like debug info. I would prefer to remove it from final sumission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not too happy about this either. I am using it in sycl/test/basic_tests/diagnostics/device-check.cpp
(see changes just below).
Alternative could be to only run the test below for OpenCL (by adding REQUIRES: opencl
) as it tests a backend independent code path anyway.
What do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not awake - completely missed what you tried to say here 🤦
You are right, this is information to enable debugging of a failed test as otherwise the environment in which the unit tests are run is invisible which makes replication of bugs very hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will replace this with output of the SYCL_BE
environment as we also do in the non-unit tests: sycl/test/lit.cfg.py#L76
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user then still gets the required info to re-run unittests by hand to recreate fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify which info is missed? if you look into failing tests (e.g. http://ci.llvm.intel.com:8010/#/builders/37/builds/689/steps/15/logs/FAIL__SYCL__reduction_nd_conditional_cpp):
'RUN: at line 4'; env SYCL_DEVICE_TYPE=GPU SYCL_BE=PI_CUDA /localdisk2/sycl_ci/buildbot/worker/Lit_With_Cuda/llvm.obj/tools/sycl/test/reduction/Output/reduction_nd_conditional.cpp.tmp.out
You see the command line with specific environment variables set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two approaches of setting environment variables in use:
- explicitly (as shown by your example) as can bee seen in the lit.cfg.py file here: sycl/test/lit.cfg.py#L127 - these will show up when a LIT test fails, and
- implicitly by manipulating the environment a LIT test will run in as seen here: sycl/test/lit.cfg.py#L45 - these are not shown when a LIT test fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current thinking about this is:
-
Use the implicit environment variables to pass the existing environment through. Failing tests will run in the same environment (ignoring CI/buildbot systems for now) when running them by hand. Otherwise LIT filters out most environment variables: llvm/utils/lit/lit/TestingConfig.py#L24
-
Use the explicit environment variable approach for settings that are very specific to the test and test setup and that we configure inside of LIT as these will be visible when a test fails (or when running lit with the
-a
flag) and allow copy-and-paste rerunning.
Sadly the explicit approach does not work when running unit tests as there is no way foreseen by LIT (that I know of) to set the environment explicitly when running GTest tests...
sycl/test/Unit/lit.cfg.py
Outdated
|
||
config.environment['SYCL_BE'] = lit_config.params.get('SYCL_BE', "PI_OPENCL") | ||
|
||
lit_config.note("Environment: {}".format(config.environment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like debug info. I would prefer to remove it from final sumission.
2d1c256
to
5e3199e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve for testing
The reduction fails of buildbot/Lit_With_Cuda have been fixed and merged 2h ago: #1641 I have pushed the rebased PR to include the fixes for the failing tests. |
5e3199e
to
d3e2fb8
Compare
d3e2fb8
to
1d7f491
Compare
// RUN: env SYCL_DEVICE_TYPE=HOST %t.out | FileCheck %s | ||
// RUN: %CPU_RUN_PLACEHOLDER %t.out %CPU_CHECK_PLACEHOLDER | ||
// RUN: %GPU_RUN_PLACEHOLDER %t.out %GPU_CHECK_PLACEHOLDER | ||
// RUN: %ACC_RUN_PLACEHOLDER %t.out %ACC_CHECK_PLACEHOLDER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes needed?
#1543 changed only line 1. Isn't it enough?
Is it related to the issue in sycl/test/basic_tests/get_nonhost_devices.cpp? If so, should it be resolved the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the test code triggers the default selector implicitly when it creates the Queue
. Therefore the PLACEHOLDER
substitutions are used (for RUN
and CHECK
) to more explicitly control what is tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// RUN: env SYCL_DEVICE_TYPE=HOST %t.out | FileCheck %s | |
// RUN: %CPU_RUN_PLACEHOLDER %t.out %CPU_CHECK_PLACEHOLDER | |
// RUN: %GPU_RUN_PLACEHOLDER %t.out %GPU_CHECK_PLACEHOLDER | |
// RUN: %ACC_RUN_PLACEHOLDER %t.out %ACC_CHECK_PLACEHOLDER | |
// RUN: env SYCL_BE=%sycl_be %t.out | FileCheck %s |
Should we really run this test on 4 different devices to validate exceptions handling?
I'd like to keep existing testing approach and align on overall testing strategy separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 The check and exception generation is done on host side (before offloading). There is no reason to run it on every device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point you both are making. Change with next pull.
sycl/test/basic_tests/queue.cpp
Outdated
// RUN: %CPU_RUN_PLACEHOLDER %t.out | ||
// RUN: %GPU_RUN_PLACEHOLDER %t.out | ||
// RUN: %ACC_RUN_PLACEHOLDER %t.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// RUN: %CPU_RUN_PLACEHOLDER %t.out | |
// RUN: %GPU_RUN_PLACEHOLDER %t.out | |
// RUN: %ACC_RUN_PLACEHOLDER %t.out | |
// RUN: env SYCL_BE=%sycl_be %t.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// RUN: env SYCL_DEVICE_TYPE=HOST %t.out | FileCheck %s | ||
// RUN: %CPU_RUN_PLACEHOLDER %t.out %CPU_CHECK_PLACEHOLDER | ||
// RUN: %GPU_RUN_PLACEHOLDER %t.out %GPU_CHECK_PLACEHOLDER | ||
// RUN: %ACC_RUN_PLACEHOLDER %t.out %ACC_CHECK_PLACEHOLDER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// RUN: env SYCL_DEVICE_TYPE=HOST %t.out | FileCheck %s | |
// RUN: %CPU_RUN_PLACEHOLDER %t.out %CPU_CHECK_PLACEHOLDER | |
// RUN: %GPU_RUN_PLACEHOLDER %t.out %GPU_CHECK_PLACEHOLDER | |
// RUN: %ACC_RUN_PLACEHOLDER %t.out %ACC_CHECK_PLACEHOLDER | |
// RUN: env SYCL_BE=%sycl_be %t.out | FileCheck %s |
Should we really run this test on 4 different devices to validate exceptions handling?
I'd like to keep existing testing approach and align on overall testing strategy separately.
Make the backend used explicit in more LIT tests. These tests failed on machines with only NVIDIA OpenCL available as it is not supported. Signed-off-by: Bjoern Knafla <[email protected]>
Pass the SYCL_BE environment variable to the SYCL-Unit unit tests. Signed-off-by: Bjoern Knafla <[email protected]>
1d7f491
to
a9bf0b6
Compare
It allows to inject extra "launcher" prefix into active *_RUN_PLACEHOLDER substitutions and can be used, for example, to execute all the tests under valgrind. However, local experiments with some internal ifrastructure showed that, while helpful, it's not enough, so two more minor modifications are done as part of this change: - Enable recursive substitutions when SYCL_E2E_RUN_LAUNCHER is enabled - Provide %e2e_tests_root substitution. It is expected to be used in conjunction with existing "%s" substitution to be able to get a unique relative path to the current test.
We are observing LIT fails on machines that only have the NVIDIA OpenCL, even if running for the SYCL PI CUDA backend.
These commits try to fix the problem but still require testing.