-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hypnos: CUDA 9 driver Restart Hangs #2344
Comments
Another simulation failed after EDIT: simulation count hanging after above read: 2 |
Workaround for now: Because a hanging simulation can block the cluster for hours to days, I implemented the following workaround. #!/bin/bash
sleep 900
grep "initialization time" output > /dev/null
if [ $? -eq 1 ]
then
mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -npernode 8 -n 64 killall -9 picongpu 2>/dev/null
fi It waits for 15 minutes and then kills picongpu if initialization failed (hung up). |
Thanks for the report. Since we have not tested the fresh Hypnos CUDA 9 chain yet, do not use it for production yet please. The currently recommended The issue could come from the MPI(IO) or HDF5 layer of the new toolchain you experimented with. It could also just be a missing CUDA awareness issue again in MPI. Please report the full set of modules / |
Which commit do you use? 0.3.1? |
@ax3l Sorry for the misleading title - I am using CUDA8 just with the new (CUDA9 capable) drivers on the k80 on hypnos. So there is no way to avoid this in production runs right now. I am loading the following modules: Currently Loaded Modulefiles:
1) gcc/4.9.2 4) cuda/8.0 7) hdf5-parallel/1.8.15
2) cmake/3.7.2 5) openmpi/1.8.6.kepler.cuda80 8) libsplash/1.6.0
3) boost/1.62.0 6) pngwriter/0.5.6 9) editor/emacs/23.1 I am running on the following PIConGPU versions:
There is currently an issue open for the IT team as well (number 7047). They just gave the following reply:
|
Ah that makes sense, thanks. Can you try to use the actual latestet release, 0.3.1? There have been three further backports to the release branch after rc2 (but I do not see from a first glance something that affects you, but it's easier to support that instead of a pre-release). The error you report is a Unfortunately, this really is a system acceptance test that needs to be done first by the IT before driver roll-outs and if they do not have the coverage we can not jump in and provide it from the user side. We will still try to debug your issue as well as possible since we now have to deal with it. Maybe they also just need to rebuild their openmpi-cuda8 module in case it links statically against some CUDA driver libs (dunno). |
With the newly compiled modules, the setup still hangs during restart. |
This was an issue with the last cluster update of hypnos (@hzdr cluster). |
106a4975f4 fix getFunctionAttributes for the SYCL backend f36e1156af update CUDA version in CI 3f8456973e use inline for CUDA/HIP code when rdc is on, otherwise use static 8b9cc3c557 fix gh-pages jobA 89d5ce671c Ignore VI temporary files 4b7bd17493 Fix the device used by KernelExecutionFixture (ComputationalRadiationPhysics#2344) 2c386dc5e9 Make alpaka follow CMAKE_CUDA_RUNTIME_LIBRARY 2d652dd233 Add thread count to CPU blocks accelerators (ComputationalRadiationPhysics#2338) dbc5ebe1e9 Fix complex math pow test (ComputationalRadiationPhysics#2336) 4995c5b22a Fix isValidWorkDivKernel to use the correct device f571ce9197 Remove unnecessary include a26cdbcd41 Re-enable the KernelNoTemplateGpu test a9217fb780 Link libcudart even when libcurand is not used 9c8614143b Suppress GCC warning about casting a function to void* ba169cdc52 Rewrite the getValidWorkDivForKernel tests 948eb757d4 Fix getValidWorkDivForKernel tests for the SYCL CPU backend f6f94f13b5 Fix getValidWorkDivForKernel tests for the CUDA backend f612f971a0 Reduce code duplications in matrixMulWithMdSpan (ComputationalRadiationPhysics#2326) d1cc2e01c1 Add a matrix multiplication example using mdspan 536a183cce Add missing whitespace in enqueue log messages 81d4410eec Reduce code duplication in CUDA/HIP kernel launch 6fdec14904 add remove-restrict 5323600508 CI: improve script utils 01d123e605 fix missing C++20 STL for ICPX in the CI d254bcd6a3 ctest: display only output of tests, which failed c9b8c941af change documentation b9ed742913 remove getValidWorkDiv itself 048ef8afca use getValidWorkDivForKernel in kernelfixture of tests 38805498f0 fix random strategies 4f175420f2 remove getValidWorkDiv first 7f08120428 CI_FILTER: ^linux_nvcc11.* 789344f019 ALPAKA_FN_HOST is not a type 4efdb9dc63 fix explicit instantiation issue fe4106f88a CI_FILTER: ^linux_nvcc11.*gcc9 e6b4881b70 CI_FILTER: ^linux_nvcc11.*gcc9 e3e760ed9e make conv2dmdspan use kernelbundle 62efffe605 Add getValidWorkDivForKernel function and KernelBundle with tests 690da679bd Let the SYCL queue implement `ConceptCurrentThreadWaitFor`, `ConceptGetDev` and `ConceptQueue` (ComputationalRadiationPhysics#2314) 995c57b54b set alpaka_CXX_STANDARD in the job generator 6ad09baa38 remove nvcc11.0 and nvcc11.1 support (ComputationalRadiationPhysics#2310) 0775f7c066 clang-format and fix typo 18eeeb7b49 move complex declaration to internal namespace 3468d2f8ac add trait IsKernelTriviallyCopyable 3015eae06b update CI container to version 3.2 56c0e416bc Update Catch2 to v3.5.2 (ComputationalRadiationPhysics#2300) git-subtree-dir: thirdParty/alpaka git-subtree-split: 106a4975f48dc38cc34f6a2494a3d16774282951
ab3092357b Use shared CUDA libraries by default 106a4975f4 fix getFunctionAttributes for the SYCL backend f36e1156af update CUDA version in CI 3f8456973e use inline for CUDA/HIP code when rdc is on, otherwise use static 8b9cc3c557 fix gh-pages jobA 89d5ce671c Ignore VI temporary files 4b7bd17493 Fix the device used by KernelExecutionFixture (ComputationalRadiationPhysics#2344) 2c386dc5e9 Make alpaka follow CMAKE_CUDA_RUNTIME_LIBRARY 2d652dd233 Add thread count to CPU blocks accelerators (ComputationalRadiationPhysics#2338) dbc5ebe1e9 Fix complex math pow test (ComputationalRadiationPhysics#2336) 4995c5b22a Fix isValidWorkDivKernel to use the correct device f571ce9197 Remove unnecessary include a26cdbcd41 Re-enable the KernelNoTemplateGpu test a9217fb780 Link libcudart even when libcurand is not used 9c8614143b Suppress GCC warning about casting a function to void* ba169cdc52 Rewrite the getValidWorkDivForKernel tests 948eb757d4 Fix getValidWorkDivForKernel tests for the SYCL CPU backend f6f94f13b5 Fix getValidWorkDivForKernel tests for the CUDA backend f612f971a0 Reduce code duplications in matrixMulWithMdSpan (ComputationalRadiationPhysics#2326) d1cc2e01c1 Add a matrix multiplication example using mdspan 536a183cce Add missing whitespace in enqueue log messages 81d4410eec Reduce code duplication in CUDA/HIP kernel launch 6fdec14904 add remove-restrict 5323600508 CI: improve script utils 01d123e605 fix missing C++20 STL for ICPX in the CI d254bcd6a3 ctest: display only output of tests, which failed c9b8c941af change documentation b9ed742913 remove getValidWorkDiv itself 048ef8afca use getValidWorkDivForKernel in kernelfixture of tests 38805498f0 fix random strategies 4f175420f2 remove getValidWorkDiv first 7f08120428 CI_FILTER: ^linux_nvcc11.* 789344f019 ALPAKA_FN_HOST is not a type 4efdb9dc63 fix explicit instantiation issue fe4106f88a CI_FILTER: ^linux_nvcc11.*gcc9 e6b4881b70 CI_FILTER: ^linux_nvcc11.*gcc9 e3e760ed9e make conv2dmdspan use kernelbundle 62efffe605 Add getValidWorkDivForKernel function and KernelBundle with tests 690da679bd Let the SYCL queue implement `ConceptCurrentThreadWaitFor`, `ConceptGetDev` and `ConceptQueue` (ComputationalRadiationPhysics#2314) 995c57b54b set alpaka_CXX_STANDARD in the job generator 6ad09baa38 remove nvcc11.0 and nvcc11.1 support (ComputationalRadiationPhysics#2310) 0775f7c066 clang-format and fix typo 18eeeb7b49 move complex declaration to internal namespace 3468d2f8ac add trait IsKernelTriviallyCopyable 3015eae06b update CI container to version 3.2 56c0e416bc Update Catch2 to v3.5.2 (ComputationalRadiationPhysics#2300) git-subtree-dir: thirdParty/alpaka git-subtree-split: ab3092357bb0b917b4cc396ce49e47f7ac1924e1
With the new driver for CUDA9, I encountered multiple hangs while restarting PIConGPU from a hdf5 checkpoint using libSplash.
The error message was:
pointing to a double free or not allocated memory.
It seems to be random hanging during restart: sometimes everything goes fine, sometimes it just stops working and does not crash.
I compiles for the k80 architecture using
37
, activated blocking kernel and setSPLASH_VERBOSE=100
. Noting worked.The last entry before the error message came from libsplash:
(
qdel
was needed to stop the job)This might be a bug in libspalsh, but since this only started to occur after the cuda9 driver update, I expect it to be something in PIConGPU.
I am working on the latest release and with various setups (LWFA, TWTS).
Any suggestions how to test this further?
The text was updated successfully, but these errors were encountered: