Hypnos: CUDA 9 driver Restart Hangs #2344

PrometheusPi · 2017-10-29T14:41:56Z

With the new driver for CUDA9, I encountered multiple hangs while restarting PIConGPU from a hdf5 checkpoint using libSplash.
The error message was:

Error in `[...]/picongpu/bin/picongpu': corrupted size vs. prev_size: 0x0000000005082cf0

pointing to a double free or not allocated memory.
It seems to be random hanging during restart: sometimes everything goes fine, sometimes it just stops working and does not crash.

I compiles for the k80 architecture using 37, activated blocking kernel and set SPLASH_VERBOSE=100. Noting worked.
The last entry before the error message came from libsplash:

[1,62]<stderr>:[SPLASH_LOG:62] Entry 'particles/e/particlePatches/offset/x' (17) is of type: UInt64
[1,62]<stderr>:[SPLASH_LOG:62] readCompleteDataSet
[1,62]<stderr>:[SPLASH_LOG:62] DCDataSet::read (x)
[1,62]<stderr>:[SPLASH_LOG:62] 
[1,62]<stderr>: ndims         = 1
[1,62]<stderr>: logical_size  = (64,1,1)
[1,62]<stderr>: physical_size = (64,1,1)
[1,62]<stderr>: dstBuffer     = (64,1,1)
[1,62]<stderr>: dstOffset     = (0,0,0)
[1,62]<stderr>: srcSize       = (64,1,1)
[1,62]<stderr>: srcOffset     = (0,0,0)
[1,62]<stderr>:
[1,8]<stderr>:*** Error in `[...]/picongpu/bin/picongpu': corrupted size vs. prev_size: 0x0000000004266de0 ***
[1,8]<stderr>:[kepler021:16502] *** Process received signal ***
[1,8]<stderr>:[kepler021:16502] Signal: Aborted (6)
[1,8]<stderr>:[kepler021:16502] Signal code:  (-6)

(qdel was needed to stop the job)

This might be a bug in libspalsh, but since this only started to occur after the cuda9 driver update, I expect it to be something in PIConGPU.

I am working on the latest release and with various setups (LWFA, TWTS).

Any suggestions how to test this further?

The text was updated successfully, but these errors were encountered:

PrometheusPi · 2017-10-29T19:57:22Z

Another simulation failed after particles/e/particlePatches/offset/x again.

EDIT: simulation count hanging after above read: 2

PrometheusPi · 2017-10-30T09:30:14Z

Workaround for now:

Because a hanging simulation can block the cluster for hours to days, I implemented the following workaround.
Right before calling picongpu, I execute the follwong script in the background ~/checkRestart.sh &:

#!/bin/bash

sleep 900

grep "initialization time" output > /dev/null

if [ $? -eq 1 ]
then
mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -npernode 8 -n 64 killall -9 picongpu 2>/dev/null
fi

It waits for 15 minutes and then kills picongpu if initialization failed (hung up).
Due to the use of `k80_profileRestart.tpl, the job is resubmitted and another attempt for restarting is submitted.

ax3l · 2017-11-01T09:51:33Z

Thanks for the report.

Since we have not tested the fresh Hypnos CUDA 9 chain yet, do not use it for production yet please. The currently recommended picongpu.profile is in our manual (using CUDA 8) and under etc/picongpu/hypnos-hzdr.

The issue could come from the MPI(IO) or HDF5 layer of the new toolchain you experimented with. It could also just be a missing CUDA awareness issue again in MPI. Please report the full set of modules / picongpu.profile file you used so others can follow. It's likely an environment issue, less likely a libSplash issue and unlikely a PIConGPU issue.

psychocoderHPC · 2017-11-01T09:56:04Z

Which commit do you use? 0.3.1?

PrometheusPi · 2017-11-01T10:08:12Z

@ax3l Sorry for the misleading title - I am using CUDA8 just with the new (CUDA9 capable) drivers on the k80 on hypnos. So there is no way to avoid this in production runs right now.

I am loading the following modules:

Currently Loaded Modulefiles:
  1) gcc/4.9.2                     4) cuda/8.0                      7) hdf5-parallel/1.8.15
  2) cmake/3.7.2                   5) openmpi/1.8.6.kepler.cuda80   8) libsplash/1.6.0
  3) boost/1.62.0                  6) pngwriter/0.5.6               9) editor/emacs/23.1

I am running on the following PIConGPU versions:

Backports for 0.3.1-rc2 Backports for 0.3.1-rc2 #2165 8491da8d9f0cb14d11400bec7536a42c72b10480 (without any changes)
Backport fix Background Fields: Fix Restart GUARD #2139 to 0.3.0 30903b10ff775465ff4e55f27cd0d5ff5f42b017 with various changes on the TWTS pulse

There is currently an issue open for the IT team as well (number 7047). They just gave the following reply:
(in German)

Zusätzlich zum Nvidia-Treiber 384.81 haben wir auf den k20- und k80-Knoten die Ubuntu-Release auf 14.04.5 aktualisiert. Weiterhin ist es so, dass Nvidia Ubuntu 14.04 eigentlich nicht mehr unterstützt, d.h. der Installer für den Treiber ist für Version 16.04, lief jedoch fehlerfrei durch.

Ich vermute den Fehler aber eher im Bereich HDF5 bzw. libsplash. Ich würde zunächst OpenMPI und die in der Chain nachfolgenden Module auf Basis von cuda/8.0 nochmal neu übersetzen. 

Folgende Module wurden jetzt für gcc/4.9.2 und cuda/8.0 neu übersetzt:

openmpi/1.8.6.kepler.cuda80
hdf5-parallel/1.8.15
libsplash/1.6.0
pngwriter/0.5.6
boost/1.62.0

Ändert sich dadurch etwas am Fehlerverhalten?

ax3l · 2017-11-01T10:19:52Z

Ah that makes sense, thanks.

Can you try to use the actual latestet release, 0.3.1? There have been three further backports to the release branch after rc2 (but I do not see from a first glance something that affects you, but it's easier to support that instead of a pre-release).

The error you report is a malloc.c stdlib issue and looks a lot like a driver/MPI/CUDA issue. We can only try to reproduce this with our default examples or smaller projects (outside PIConGPU).

Unfortunately, this really is a system acceptance test that needs to be done first by the IT before driver roll-outs and if they do not have the coverage we can not jump in and provide it from the user side. We will still try to debug your issue as well as possible since we now have to deal with it. Maybe they also just need to rebuild their openmpi-cuda8 module in case it links statically against some CUDA driver libs (dunno).

PrometheusPi · 2017-11-02T08:57:41Z

With the newly compiled modules, the setup still hangs during restart.

psychocoderHPC · 2017-11-08T09:29:24Z

This was an issue with the last cluster update of hypnos (@hzdr cluster).

106a4975f4 fix getFunctionAttributes for the SYCL backend f36e1156af update CUDA version in CI 3f8456973e use inline for CUDA/HIP code when rdc is on, otherwise use static 8b9cc3c557 fix gh-pages jobA 89d5ce671c Ignore VI temporary files 4b7bd17493 Fix the device used by KernelExecutionFixture (ComputationalRadiationPhysics#2344) 2c386dc5e9 Make alpaka follow CMAKE_CUDA_RUNTIME_LIBRARY 2d652dd233 Add thread count to CPU blocks accelerators (ComputationalRadiationPhysics#2338) dbc5ebe1e9 Fix complex math pow test (ComputationalRadiationPhysics#2336) 4995c5b22a Fix isValidWorkDivKernel to use the correct device f571ce9197 Remove unnecessary include a26cdbcd41 Re-enable the KernelNoTemplateGpu test a9217fb780 Link libcudart even when libcurand is not used 9c8614143b Suppress GCC warning about casting a function to void* ba169cdc52 Rewrite the getValidWorkDivForKernel tests 948eb757d4 Fix getValidWorkDivForKernel tests for the SYCL CPU backend f6f94f13b5 Fix getValidWorkDivForKernel tests for the CUDA backend f612f971a0 Reduce code duplications in matrixMulWithMdSpan (ComputationalRadiationPhysics#2326) d1cc2e01c1 Add a matrix multiplication example using mdspan 536a183cce Add missing whitespace in enqueue log messages 81d4410eec Reduce code duplication in CUDA/HIP kernel launch 6fdec14904 add remove-restrict 5323600508 CI: improve script utils 01d123e605 fix missing C++20 STL for ICPX in the CI d254bcd6a3 ctest: display only output of tests, which failed c9b8c941af change documentation b9ed742913 remove getValidWorkDiv itself 048ef8afca use getValidWorkDivForKernel in kernelfixture of tests 38805498f0 fix random strategies 4f175420f2 remove getValidWorkDiv first 7f08120428 CI_FILTER: ^linux_nvcc11.* 789344f019 ALPAKA_FN_HOST is not a type 4efdb9dc63 fix explicit instantiation issue fe4106f88a CI_FILTER: ^linux_nvcc11.*gcc9 e6b4881b70 CI_FILTER: ^linux_nvcc11.*gcc9 e3e760ed9e make conv2dmdspan use kernelbundle 62efffe605 Add getValidWorkDivForKernel function and KernelBundle with tests 690da679bd Let the SYCL queue implement `ConceptCurrentThreadWaitFor`, `ConceptGetDev` and `ConceptQueue` (ComputationalRadiationPhysics#2314) 995c57b54b set alpaka_CXX_STANDARD in the job generator 6ad09baa38 remove nvcc11.0 and nvcc11.1 support (ComputationalRadiationPhysics#2310) 0775f7c066 clang-format and fix typo 18eeeb7b49 move complex declaration to internal namespace 3468d2f8ac add trait IsKernelTriviallyCopyable 3015eae06b update CI container to version 3.2 56c0e416bc Update Catch2 to v3.5.2 (ComputationalRadiationPhysics#2300) git-subtree-dir: thirdParty/alpaka git-subtree-split: 106a4975f48dc38cc34f6a2494a3d16774282951

ab3092357b Use shared CUDA libraries by default 106a4975f4 fix getFunctionAttributes for the SYCL backend f36e1156af update CUDA version in CI 3f8456973e use inline for CUDA/HIP code when rdc is on, otherwise use static 8b9cc3c557 fix gh-pages jobA 89d5ce671c Ignore VI temporary files 4b7bd17493 Fix the device used by KernelExecutionFixture (ComputationalRadiationPhysics#2344) 2c386dc5e9 Make alpaka follow CMAKE_CUDA_RUNTIME_LIBRARY 2d652dd233 Add thread count to CPU blocks accelerators (ComputationalRadiationPhysics#2338) dbc5ebe1e9 Fix complex math pow test (ComputationalRadiationPhysics#2336) 4995c5b22a Fix isValidWorkDivKernel to use the correct device f571ce9197 Remove unnecessary include a26cdbcd41 Re-enable the KernelNoTemplateGpu test a9217fb780 Link libcudart even when libcurand is not used 9c8614143b Suppress GCC warning about casting a function to void* ba169cdc52 Rewrite the getValidWorkDivForKernel tests 948eb757d4 Fix getValidWorkDivForKernel tests for the SYCL CPU backend f6f94f13b5 Fix getValidWorkDivForKernel tests for the CUDA backend f612f971a0 Reduce code duplications in matrixMulWithMdSpan (ComputationalRadiationPhysics#2326) d1cc2e01c1 Add a matrix multiplication example using mdspan 536a183cce Add missing whitespace in enqueue log messages 81d4410eec Reduce code duplication in CUDA/HIP kernel launch 6fdec14904 add remove-restrict 5323600508 CI: improve script utils 01d123e605 fix missing C++20 STL for ICPX in the CI d254bcd6a3 ctest: display only output of tests, which failed c9b8c941af change documentation b9ed742913 remove getValidWorkDiv itself 048ef8afca use getValidWorkDivForKernel in kernelfixture of tests 38805498f0 fix random strategies 4f175420f2 remove getValidWorkDiv first 7f08120428 CI_FILTER: ^linux_nvcc11.* 789344f019 ALPAKA_FN_HOST is not a type 4efdb9dc63 fix explicit instantiation issue fe4106f88a CI_FILTER: ^linux_nvcc11.*gcc9 e6b4881b70 CI_FILTER: ^linux_nvcc11.*gcc9 e3e760ed9e make conv2dmdspan use kernelbundle 62efffe605 Add getValidWorkDivForKernel function and KernelBundle with tests 690da679bd Let the SYCL queue implement `ConceptCurrentThreadWaitFor`, `ConceptGetDev` and `ConceptQueue` (ComputationalRadiationPhysics#2314) 995c57b54b set alpaka_CXX_STANDARD in the job generator 6ad09baa38 remove nvcc11.0 and nvcc11.1 support (ComputationalRadiationPhysics#2310) 0775f7c066 clang-format and fix typo 18eeeb7b49 move complex declaration to internal namespace 3468d2f8ac add trait IsKernelTriviallyCopyable 3015eae06b update CI container to version 3.2 56c0e416bc Update Catch2 to v3.5.2 (ComputationalRadiationPhysics#2300) git-subtree-dir: thirdParty/alpaka git-subtree-split: ab3092357bb0b917b4cc396ce49e47f7ac1924e1

PrometheusPi added the bug a bug in the project's code label Oct 29, 2017

ax3l added this to the 0.4.0 / 1.0.0: Next Stable milestone Nov 1, 2017

ax3l added component: plugin in PIConGPU plugin backend: cuda CUDA backend labels Nov 1, 2017

ax3l changed the title ~~restart hangs with new driver on hypnos~~ Hypnos: CUDA 9 Restart Hangs Nov 1, 2017

PrometheusPi changed the title ~~Hypnos: CUDA 9 Restart Hangs~~ Hypnos: CUDA 9 driver Restart Hangs Nov 1, 2017

psychocoderHPC closed this as completed Nov 8, 2017

psychocoderHPC added the outdated/wontfix outdated or out of scope label Nov 8, 2017

ax3l added the machine/system machine & HPC system specific issues label Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hypnos: CUDA 9 driver Restart Hangs #2344

Hypnos: CUDA 9 driver Restart Hangs #2344

PrometheusPi commented Oct 29, 2017 •

edited

Loading

PrometheusPi commented Oct 29, 2017 •

edited

Loading

PrometheusPi commented Oct 30, 2017

ax3l commented Nov 1, 2017

psychocoderHPC commented Nov 1, 2017 •

edited by ax3l

Loading

PrometheusPi commented Nov 1, 2017

ax3l commented Nov 1, 2017

PrometheusPi commented Nov 2, 2017

psychocoderHPC commented Nov 8, 2017

Hypnos: CUDA 9 driver Restart Hangs #2344

Hypnos: CUDA 9 driver Restart Hangs #2344

Comments

PrometheusPi commented Oct 29, 2017 • edited Loading

PrometheusPi commented Oct 29, 2017 • edited Loading

PrometheusPi commented Oct 30, 2017

ax3l commented Nov 1, 2017

psychocoderHPC commented Nov 1, 2017 • edited by ax3l Loading

PrometheusPi commented Nov 1, 2017

ax3l commented Nov 1, 2017

PrometheusPi commented Nov 2, 2017

psychocoderHPC commented Nov 8, 2017

PrometheusPi commented Oct 29, 2017 •

edited

Loading

PrometheusPi commented Oct 29, 2017 •

edited

Loading

psychocoderHPC commented Nov 1, 2017 •

edited by ax3l

Loading