Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypnos: CUDA 9 driver Restart Hangs #2344

Closed
PrometheusPi opened this issue Oct 29, 2017 · 8 comments
Closed

Hypnos: CUDA 9 driver Restart Hangs #2344

PrometheusPi opened this issue Oct 29, 2017 · 8 comments
Labels
backend: cuda CUDA backend bug a bug in the project's code component: plugin in PIConGPU plugin machine/system machine & HPC system specific issues outdated/wontfix outdated or out of scope

Comments

@PrometheusPi
Copy link
Member

PrometheusPi commented Oct 29, 2017

With the new driver for CUDA9, I encountered multiple hangs while restarting PIConGPU from a hdf5 checkpoint using libSplash.
The error message was:

Error in `[...]/picongpu/bin/picongpu': corrupted size vs. prev_size: 0x0000000005082cf0

pointing to a double free or not allocated memory.
It seems to be random hanging during restart: sometimes everything goes fine, sometimes it just stops working and does not crash.

I compiles for the k80 architecture using 37, activated blocking kernel and set SPLASH_VERBOSE=100. Noting worked.
The last entry before the error message came from libsplash:

[1,62]<stderr>:[SPLASH_LOG:62] Entry 'particles/e/particlePatches/offset/x' (17) is of type: UInt64
[1,62]<stderr>:[SPLASH_LOG:62] readCompleteDataSet
[1,62]<stderr>:[SPLASH_LOG:62] DCDataSet::read (x)
[1,62]<stderr>:[SPLASH_LOG:62] 
[1,62]<stderr>: ndims         = 1
[1,62]<stderr>: logical_size  = (64,1,1)
[1,62]<stderr>: physical_size = (64,1,1)
[1,62]<stderr>: dstBuffer     = (64,1,1)
[1,62]<stderr>: dstOffset     = (0,0,0)
[1,62]<stderr>: srcSize       = (64,1,1)
[1,62]<stderr>: srcOffset     = (0,0,0)
[1,62]<stderr>:
[1,8]<stderr>:*** Error in `[...]/picongpu/bin/picongpu': corrupted size vs. prev_size: 0x0000000004266de0 ***
[1,8]<stderr>:[kepler021:16502] *** Process received signal ***
[1,8]<stderr>:[kepler021:16502] Signal: Aborted (6)
[1,8]<stderr>:[kepler021:16502] Signal code:  (-6)

(qdel was needed to stop the job)

This might be a bug in libspalsh, but since this only started to occur after the cuda9 driver update, I expect it to be something in PIConGPU.

I am working on the latest release and with various setups (LWFA, TWTS).

Any suggestions how to test this further?

@PrometheusPi PrometheusPi added the bug a bug in the project's code label Oct 29, 2017
@PrometheusPi
Copy link
Member Author

PrometheusPi commented Oct 29, 2017

Another simulation failed after particles/e/particlePatches/offset/x again.

EDIT: simulation count hanging after above read: 2

@PrometheusPi
Copy link
Member Author

Workaround for now:

Because a hanging simulation can block the cluster for hours to days, I implemented the following workaround.
Right before calling picongpu, I execute the follwong script in the background ~/checkRestart.sh &:

#!/bin/bash

sleep 900

grep "initialization time" output > /dev/null

if [ $? -eq 1 ]
then
mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -npernode 8 -n 64 killall -9 picongpu 2>/dev/null
fi

It waits for 15 minutes and then kills picongpu if initialization failed (hung up).
Due to the use of `k80_profileRestart.tpl, the job is resubmitted and another attempt for restarting is submitted.

@ax3l ax3l added this to the 0.4.0 / 1.0.0: Next Stable milestone Nov 1, 2017
@ax3l ax3l added component: plugin in PIConGPU plugin backend: cuda CUDA backend labels Nov 1, 2017
@ax3l
Copy link
Member

ax3l commented Nov 1, 2017

Thanks for the report.

Since we have not tested the fresh Hypnos CUDA 9 chain yet, do not use it for production yet please. The currently recommended picongpu.profile is in our manual (using CUDA 8) and under etc/picongpu/hypnos-hzdr.

The issue could come from the MPI(IO) or HDF5 layer of the new toolchain you experimented with. It could also just be a missing CUDA awareness issue again in MPI. Please report the full set of modules / picongpu.profile file you used so others can follow. It's likely an environment issue, less likely a libSplash issue and unlikely a PIConGPU issue.

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Nov 1, 2017

Which commit do you use? 0.3.1?

@ax3l ax3l changed the title restart hangs with new driver on hypnos Hypnos: CUDA 9 Restart Hangs Nov 1, 2017
@PrometheusPi PrometheusPi changed the title Hypnos: CUDA 9 Restart Hangs Hypnos: CUDA 9 driver Restart Hangs Nov 1, 2017
@PrometheusPi
Copy link
Member Author

@ax3l Sorry for the misleading title - I am using CUDA8 just with the new (CUDA9 capable) drivers on the k80 on hypnos. So there is no way to avoid this in production runs right now.

I am loading the following modules:

Currently Loaded Modulefiles:
  1) gcc/4.9.2                     4) cuda/8.0                      7) hdf5-parallel/1.8.15
  2) cmake/3.7.2                   5) openmpi/1.8.6.kepler.cuda80   8) libsplash/1.6.0
  3) boost/1.62.0                  6) pngwriter/0.5.6               9) editor/emacs/23.1

I am running on the following PIConGPU versions:

There is currently an issue open for the IT team as well (number 7047). They just gave the following reply:
(in German)

Zusätzlich zum Nvidia-Treiber 384.81 haben wir auf den k20- und k80-Knoten die Ubuntu-Release auf 14.04.5 aktualisiert. Weiterhin ist es so, dass Nvidia Ubuntu 14.04 eigentlich nicht mehr unterstützt, d.h. der Installer für den Treiber ist für Version 16.04, lief jedoch fehlerfrei durch.

Ich vermute den Fehler aber eher im Bereich HDF5 bzw. libsplash. Ich würde zunächst OpenMPI und die in der Chain nachfolgenden Module auf Basis von cuda/8.0 nochmal neu übersetzen. 

Folgende Module wurden jetzt für gcc/4.9.2 und cuda/8.0 neu übersetzt:

openmpi/1.8.6.kepler.cuda80
hdf5-parallel/1.8.15
libsplash/1.6.0
pngwriter/0.5.6
boost/1.62.0

Ändert sich dadurch etwas am Fehlerverhalten? 

@ax3l
Copy link
Member

ax3l commented Nov 1, 2017

Ah that makes sense, thanks.

Can you try to use the actual latestet release, 0.3.1? There have been three further backports to the release branch after rc2 (but I do not see from a first glance something that affects you, but it's easier to support that instead of a pre-release).

The error you report is a malloc.c stdlib issue and looks a lot like a driver/MPI/CUDA issue. We can only try to reproduce this with our default examples or smaller projects (outside PIConGPU).

Unfortunately, this really is a system acceptance test that needs to be done first by the IT before driver roll-outs and if they do not have the coverage we can not jump in and provide it from the user side. We will still try to debug your issue as well as possible since we now have to deal with it. Maybe they also just need to rebuild their openmpi-cuda8 module in case it links statically against some CUDA driver libs (dunno).

@PrometheusPi
Copy link
Member Author

With the newly compiled modules, the setup still hangs during restart.

@psychocoderHPC
Copy link
Member

This was an issue with the last cluster update of hypnos (@hzdr cluster).

@psychocoderHPC psychocoderHPC added the outdated/wontfix outdated or out of scope label Nov 8, 2017
@ax3l ax3l added the machine/system machine & HPC system specific issues label Feb 14, 2018
psychocoderHPC pushed a commit to psychocoderHPC/picongpu that referenced this issue Aug 7, 2024
106a4975f4 fix getFunctionAttributes for the SYCL backend
f36e1156af update CUDA version in CI
3f8456973e use inline for CUDA/HIP code when rdc is on, otherwise use static
8b9cc3c557 fix gh-pages jobA
89d5ce671c Ignore VI temporary files
4b7bd17493 Fix the device used by KernelExecutionFixture (ComputationalRadiationPhysics#2344)
2c386dc5e9 Make alpaka follow CMAKE_CUDA_RUNTIME_LIBRARY
2d652dd233 Add thread count to CPU blocks accelerators (ComputationalRadiationPhysics#2338)
dbc5ebe1e9 Fix complex math pow test (ComputationalRadiationPhysics#2336)
4995c5b22a Fix isValidWorkDivKernel to use the correct device
f571ce9197 Remove unnecessary include
a26cdbcd41 Re-enable the KernelNoTemplateGpu test
a9217fb780 Link libcudart even when libcurand is not used
9c8614143b Suppress GCC warning about casting a function to void*
ba169cdc52 Rewrite the getValidWorkDivForKernel tests
948eb757d4 Fix getValidWorkDivForKernel tests for the SYCL CPU backend
f6f94f13b5 Fix getValidWorkDivForKernel tests for the CUDA backend
f612f971a0 Reduce code duplications in matrixMulWithMdSpan (ComputationalRadiationPhysics#2326)
d1cc2e01c1 Add a matrix multiplication example using mdspan
536a183cce Add missing whitespace in enqueue log messages
81d4410eec Reduce code duplication in CUDA/HIP kernel launch
6fdec14904  add remove-restrict
5323600508 CI: improve script utils
01d123e605 fix missing C++20 STL for ICPX in the CI
d254bcd6a3 ctest: display only output of tests, which failed
c9b8c941af change documentation
b9ed742913 remove getValidWorkDiv itself
048ef8afca use getValidWorkDivForKernel in kernelfixture of tests
38805498f0 fix random strategies
4f175420f2 remove getValidWorkDiv first
7f08120428 CI_FILTER: ^linux_nvcc11.*
789344f019 ALPAKA_FN_HOST is not a type
4efdb9dc63 fix explicit instantiation issue
fe4106f88a CI_FILTER: ^linux_nvcc11.*gcc9
e6b4881b70 CI_FILTER: ^linux_nvcc11.*gcc9
e3e760ed9e make conv2dmdspan use kernelbundle
62efffe605 Add getValidWorkDivForKernel function and KernelBundle with tests
690da679bd Let the SYCL queue implement `ConceptCurrentThreadWaitFor`, `ConceptGetDev` and `ConceptQueue` (ComputationalRadiationPhysics#2314)
995c57b54b set alpaka_CXX_STANDARD in the job generator
6ad09baa38 remove nvcc11.0 and nvcc11.1 support (ComputationalRadiationPhysics#2310)
0775f7c066 clang-format and fix typo
18eeeb7b49 move complex declaration to internal namespace
3468d2f8ac add trait IsKernelTriviallyCopyable
3015eae06b update CI container to version 3.2
56c0e416bc Update Catch2 to v3.5.2 (ComputationalRadiationPhysics#2300)

git-subtree-dir: thirdParty/alpaka
git-subtree-split: 106a4975f48dc38cc34f6a2494a3d16774282951
psychocoderHPC pushed a commit to psychocoderHPC/picongpu that referenced this issue Aug 9, 2024
ab3092357b Use shared CUDA libraries by default
106a4975f4 fix getFunctionAttributes for the SYCL backend
f36e1156af update CUDA version in CI
3f8456973e use inline for CUDA/HIP code when rdc is on, otherwise use static
8b9cc3c557 fix gh-pages jobA
89d5ce671c Ignore VI temporary files
4b7bd17493 Fix the device used by KernelExecutionFixture (ComputationalRadiationPhysics#2344)
2c386dc5e9 Make alpaka follow CMAKE_CUDA_RUNTIME_LIBRARY
2d652dd233 Add thread count to CPU blocks accelerators (ComputationalRadiationPhysics#2338)
dbc5ebe1e9 Fix complex math pow test (ComputationalRadiationPhysics#2336)
4995c5b22a Fix isValidWorkDivKernel to use the correct device
f571ce9197 Remove unnecessary include
a26cdbcd41 Re-enable the KernelNoTemplateGpu test
a9217fb780 Link libcudart even when libcurand is not used
9c8614143b Suppress GCC warning about casting a function to void*
ba169cdc52 Rewrite the getValidWorkDivForKernel tests
948eb757d4 Fix getValidWorkDivForKernel tests for the SYCL CPU backend
f6f94f13b5 Fix getValidWorkDivForKernel tests for the CUDA backend
f612f971a0 Reduce code duplications in matrixMulWithMdSpan (ComputationalRadiationPhysics#2326)
d1cc2e01c1 Add a matrix multiplication example using mdspan
536a183cce Add missing whitespace in enqueue log messages
81d4410eec Reduce code duplication in CUDA/HIP kernel launch
6fdec14904  add remove-restrict
5323600508 CI: improve script utils
01d123e605 fix missing C++20 STL for ICPX in the CI
d254bcd6a3 ctest: display only output of tests, which failed
c9b8c941af change documentation
b9ed742913 remove getValidWorkDiv itself
048ef8afca use getValidWorkDivForKernel in kernelfixture of tests
38805498f0 fix random strategies
4f175420f2 remove getValidWorkDiv first
7f08120428 CI_FILTER: ^linux_nvcc11.*
789344f019 ALPAKA_FN_HOST is not a type
4efdb9dc63 fix explicit instantiation issue
fe4106f88a CI_FILTER: ^linux_nvcc11.*gcc9
e6b4881b70 CI_FILTER: ^linux_nvcc11.*gcc9
e3e760ed9e make conv2dmdspan use kernelbundle
62efffe605 Add getValidWorkDivForKernel function and KernelBundle with tests
690da679bd Let the SYCL queue implement `ConceptCurrentThreadWaitFor`, `ConceptGetDev` and `ConceptQueue` (ComputationalRadiationPhysics#2314)
995c57b54b set alpaka_CXX_STANDARD in the job generator
6ad09baa38 remove nvcc11.0 and nvcc11.1 support (ComputationalRadiationPhysics#2310)
0775f7c066 clang-format and fix typo
18eeeb7b49 move complex declaration to internal namespace
3468d2f8ac add trait IsKernelTriviallyCopyable
3015eae06b update CI container to version 3.2
56c0e416bc Update Catch2 to v3.5.2 (ComputationalRadiationPhysics#2300)

git-subtree-dir: thirdParty/alpaka
git-subtree-split: ab3092357bb0b917b4cc396ce49e47f7ac1924e1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: cuda CUDA backend bug a bug in the project's code component: plugin in PIConGPU plugin machine/system machine & HPC system specific issues outdated/wontfix outdated or out of scope
Projects
None yet
Development

No branches or pull requests

3 participants