Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Fix LIT fails on machines without non-NVIDIA OpenCL #1613

Merged

Conversation

bjoernknafla
Copy link
Contributor

We are observing LIT fails on machines that only have the NVIDIA OpenCL, even if running for the SYCL PI CUDA backend.

These commits try to fix the problem but still require testing.

@bjoernknafla bjoernknafla requested a review from a team as a code owner April 29, 2020 17:34

config.environment['SYCL_BE'] = lit_config.params.get('SYCL_BE', "PI_OPENCL")

lit_config.note("Environment: {}".format(config.environment))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a better way to print the implicit environment with which LIT runs the unit tests, e.g., one that allows a copy and paste approach to setting up the environment locally to investiage fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like debug info. I would prefer to remove it from final sumission.

Copy link
Contributor Author

@bjoernknafla bjoernknafla Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too happy about this either. I am using it in sycl/test/basic_tests/diagnostics/device-check.cpp (see changes just below).

Alternative could be to only run the test below for OpenCL (by adding REQUIRES: opencl) as it tests a backend independent code path anyway.

What do you prefer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not awake - completely missed what you tried to say here 🤦

You are right, this is information to enable debugging of a failed test as otherwise the environment in which the unit tests are run is invisible which makes replication of bugs very hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will replace this with output of the SYCL_BE environment as we also do in the non-unit tests: sycl/test/lit.cfg.py#L76

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user then still gets the required info to re-run unittests by hand to recreate fails.

Copy link
Contributor

@vladimirlaz vladimirlaz May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify which info is missed? if you look into failing tests (e.g. http://ci.llvm.intel.com:8010/#/builders/37/builds/689/steps/15/logs/FAIL__SYCL__reduction_nd_conditional_cpp):
'RUN: at line 4'; env SYCL_DEVICE_TYPE=GPU SYCL_BE=PI_CUDA /localdisk2/sycl_ci/buildbot/worker/Lit_With_Cuda/llvm.obj/tools/sycl/test/reduction/Output/reduction_nd_conditional.cpp.tmp.out

You see the command line with specific environment variables set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two approaches of setting environment variables in use:

  • explicitly (as shown by your example) as can bee seen in the lit.cfg.py file here: sycl/test/lit.cfg.py#L127 - these will show up when a LIT test fails, and
  • implicitly by manipulating the environment a LIT test will run in as seen here: sycl/test/lit.cfg.py#L45 - these are not shown when a LIT test fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current thinking about this is:

  • Use the implicit environment variables to pass the existing environment through. Failing tests will run in the same environment (ignoring CI/buildbot systems for now) when running them by hand. Otherwise LIT filters out most environment variables: llvm/utils/lit/lit/TestingConfig.py#L24

  • Use the explicit environment variable approach for settings that are very specific to the test and test setup and that we configure inside of LIT as these will be visible when a test fails (or when running lit with the -a flag) and allow copy-and-paste rerunning.

Sadly the explicit approach does not work when running unit tests as there is no way foreseen by LIT (that I know of) to set the environment explicitly when running GTest tests...

sycl/test/basic_tests/get_nonhost_devices.cpp Outdated Show resolved Hide resolved

config.environment['SYCL_BE'] = lit_config.params.get('SYCL_BE', "PI_OPENCL")

lit_config.note("Environment: {}".format(config.environment))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like debug info. I would prefer to remove it from final sumission.

@bjoernknafla bjoernknafla force-pushed the bjoern/add-more-cuda-support-to-lit branch from 2d1c256 to 5e3199e Compare May 6, 2020 15:26
@bjoernknafla bjoernknafla requested a review from vladimirlaz May 6, 2020 15:27
vladimirlaz
vladimirlaz previously approved these changes May 6, 2020
Copy link
Contributor

@vladimirlaz vladimirlaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve for testing

@vladimirlaz vladimirlaz self-requested a review May 6, 2020 15:52
@bader bader added the cuda CUDA back-end label May 6, 2020
@bjoernknafla
Copy link
Contributor Author

bjoernknafla commented May 6, 2020

The reduction fails of buildbot/Lit_With_Cuda have been fixed and merged 2h ago: #1641

I have pushed the rebased PR to include the fixes for the failing tests.

@bjoernknafla bjoernknafla force-pushed the bjoern/add-more-cuda-support-to-lit branch from 5e3199e to d3e2fb8 Compare May 6, 2020 19:53
@bjoernknafla bjoernknafla force-pushed the bjoern/add-more-cuda-support-to-lit branch from d3e2fb8 to 1d7f491 Compare May 7, 2020 14:49
Comment on lines 2 to 5
// RUN: env SYCL_DEVICE_TYPE=HOST %t.out | FileCheck %s
// RUN: %CPU_RUN_PLACEHOLDER %t.out %CPU_CHECK_PLACEHOLDER
// RUN: %GPU_RUN_PLACEHOLDER %t.out %GPU_CHECK_PLACEHOLDER
// RUN: %ACC_RUN_PLACEHOLDER %t.out %ACC_CHECK_PLACEHOLDER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes needed?
#1543 changed only line 1. Isn't it enough?

Is it related to the issue in sycl/test/basic_tests/get_nonhost_devices.cpp? If so, should it be resolved the same way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the test code triggers the default selector implicitly when it creates the Queue. Therefore the PLACEHOLDER substitutions are used (for RUN and CHECK) to more explicitly control what is tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RUN: env SYCL_DEVICE_TYPE=HOST %t.out | FileCheck %s
// RUN: %CPU_RUN_PLACEHOLDER %t.out %CPU_CHECK_PLACEHOLDER
// RUN: %GPU_RUN_PLACEHOLDER %t.out %GPU_CHECK_PLACEHOLDER
// RUN: %ACC_RUN_PLACEHOLDER %t.out %ACC_CHECK_PLACEHOLDER
// RUN: env SYCL_BE=%sycl_be %t.out | FileCheck %s

Should we really run this test on 4 different devices to validate exceptions handling?

I'd like to keep existing testing approach and align on overall testing strategy separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 The check and exception generation is done on host side (before offloading). There is no reason to run it on every device.

Copy link
Contributor Author

@bjoernknafla bjoernknafla May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point you both are making. Change with next pull.

@bader bader requested a review from vladimirlaz May 7, 2020 15:04
Comment on lines 3 to 5
// RUN: %CPU_RUN_PLACEHOLDER %t.out
// RUN: %GPU_RUN_PLACEHOLDER %t.out
// RUN: %ACC_RUN_PLACEHOLDER %t.out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RUN: %CPU_RUN_PLACEHOLDER %t.out
// RUN: %GPU_RUN_PLACEHOLDER %t.out
// RUN: %ACC_RUN_PLACEHOLDER %t.out
// RUN: env SYCL_BE=%sycl_be %t.out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 2 to 5
// RUN: env SYCL_DEVICE_TYPE=HOST %t.out | FileCheck %s
// RUN: %CPU_RUN_PLACEHOLDER %t.out %CPU_CHECK_PLACEHOLDER
// RUN: %GPU_RUN_PLACEHOLDER %t.out %GPU_CHECK_PLACEHOLDER
// RUN: %ACC_RUN_PLACEHOLDER %t.out %ACC_CHECK_PLACEHOLDER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RUN: env SYCL_DEVICE_TYPE=HOST %t.out | FileCheck %s
// RUN: %CPU_RUN_PLACEHOLDER %t.out %CPU_CHECK_PLACEHOLDER
// RUN: %GPU_RUN_PLACEHOLDER %t.out %GPU_CHECK_PLACEHOLDER
// RUN: %ACC_RUN_PLACEHOLDER %t.out %ACC_CHECK_PLACEHOLDER
// RUN: env SYCL_BE=%sycl_be %t.out | FileCheck %s

Should we really run this test on 4 different devices to validate exceptions handling?

I'd like to keep existing testing approach and align on overall testing strategy separately.

Make the backend used explicit in more LIT tests.
These tests failed on machines with only NVIDIA OpenCL available as it
is not supported.

Signed-off-by: Bjoern Knafla <[email protected]>
Pass the SYCL_BE environment variable to the SYCL-Unit unit tests.

Signed-off-by: Bjoern Knafla <[email protected]>
@bjoernknafla bjoernknafla force-pushed the bjoern/add-more-cuda-support-to-lit branch from 1d7f491 to a9bf0b6 Compare May 13, 2020 16:54
@bader bader merged commit bce2da2 into intel:sycl May 14, 2020
@bjoernknafla bjoernknafla deleted the bjoern/add-more-cuda-support-to-lit branch May 14, 2020 08:18
aelovikov-intel added a commit to aelovikov-intel/llvm that referenced this pull request Feb 23, 2023
It allows to inject extra "launcher" prefix into
active *_RUN_PLACEHOLDER substitutions and can be used, for example, to
execute all the tests under valgrind.

However, local experiments with some internal ifrastructure showed that,
while helpful, it's not enough, so two more minor modifications are done
as part of this change:
  - Enable recursive substitutions when SYCL_E2E_RUN_LAUNCHER is enabled
  - Provide %e2e_tests_root substitution. It is expected to be used in
    conjunction with existing "%s" substitution to be able to get a
    unique relative path to the current test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants