DipolarBarnesHutGpu and DipolarDirectSumGpu tests fail on ROCm #3895

espresso-ci · 2020-09-14T15:06:39Z

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/pipelines/13105

jngrad · 2020-09-15T16:51:45Z

Currently affecting the following PRs: #3891 #3896

These tests fail randomly since yesterday:

The following tests FAILED:
	 32 - dawaanr-and-dds-gpu (Failed)
	 33 - dawaanr-and-bh-gpu (Failed)
	 34 - dds-and-bh-gpu (Failed)

The failures are reproducible with the docker image on coyote11. More details in #3891 (comment).

jngrad · 2020-09-15T17:36:44Z

Reproducible as far back as 79b53f6. Before that, we used a different docker image and a different compiler (HCC).

mkuron · 2020-09-15T18:34:40Z

Hmm then why did it just start coming up in CI this week?

@psci2195 any ideas what might be causing it? I don‘t know enough about the algorithm to understand your code.

mkuron · 2020-09-16T09:23:41Z

coyote10 was just rebooted and the issue appears resolved. Looks like a hardware or driver glitch.

jngrad · 2020-09-16T09:23:56Z

Rebooting the runner seems to have fixed the issue. Before the reboot:

BH GPU0
- free(): invalid pointer
- Forces on particle do not match
- Torques on particle do not match
BH GPU1
- Forces on particle do not match
- Torques on particle do not match
DDS GPU0
- MPI deadlock
- std::runtime_error
- free(): invalid pointer
DDS GPU1: all good

After the reboot: all good

jngrad · 2020-09-16T10:25:35Z

Rebooting the runner only made the issue irreproducible in docker in a ssh terminal. It's still failing in CI pipelines.

The 4.1.4 release might be affected too. In docker in a ssh terminal, there is a random MPI deadlock with dawaanr-and-bh-gpu.

jngrad · 2020-09-16T12:52:38Z

We're temporarily running the ROCm jobs on lama to clear the backlog.

jngrad · 2020-10-13T18:35:14Z

We exchanged one Vega 56 from coyote10 with the Vega 58 in lama. The ROCm job was run multiple times on the lama and coyote10runners without failing:

logfiles for coyote10 with GPU from lama:
256766
256770
256771
256773
256777
256779
256783
256782
logfiles for lama with GPU from coyote10:
256767
256769
256772
256774
256780
256781

We will leave the GPUs in this configuration for now. If the issue resurfaces on the Vega 56 but not on the Vega 58 in coyote10, we can pause the corresponding runner (GPU0 or GPU1).

jngrad · 2020-10-23T07:58:57Z

Failing again on coyote10 GPU0 (logfile).

Closes #2973, closes #3895, follow-up to espressomd/docker#190 Due to their fast release cycle, ROCm packages are not stable enough for ESPResSo. We are currently supporting ROCm 3.0 to 3.7, which means supporting two compilers (HCC and HIP-Clang) and keeping patches for each ROCm release in our codebase. Maintaining these patches and the CMake build system for ROCm is time-consuming. The ROCm job also has a tendency to break the CI pipeline (#2973), sometimes due to random or irreproducible software bugs in ROCm, sometimes due to failing hardware in both the main CI runner and backup CI runner. The frequency of false positives in CI is too large compared to the number of true positives. The last true positives were 5da80a9 (April 2020) and #2973 (comment) (July 2019). There are no known users of ESPResSo on AMD GPUs according to the [May 2020 user survey](https://lists.nongnu.org/archive/html/espressomd-users/2020-05/msg00001.html). The core team has decided to drop support for ROCm ([ESPResSo meeting 2020-10-20](https://github.com/espressomd/espresso/wiki/Espresso-meeting-2020-10-20)). The Intel image cannot be built automatically in the espressomd/docker pipeline. Re-building it manually is a time-consuming procedure, usually several hours, due to long response time from the licensing server and the size of the Parallel Studio XE source code. When a new Intel compiler version is available, it usually requires the expertise of two people to update the dockerfile. The core team has decided to remove the Intel job from CI and use the Fedora job to test ESPResSo with MPICH.

jngrad · 2020-10-23T10:06:50Z

Failing on both coyote10 GPU0 (logfile) and coyote10 GPU1 (logfile). The lama backup has timeout issues.

jngrad changed the title ~~CI build failed for merged PR~~ DipolarBarnesHutGpu and DipolarDirectSumGpu tests fail on ROCm Sep 15, 2020

jngrad mentioned this issue Sep 16, 2020

CI build failed for merged PR #3897

Closed

espresso-ci closed this as completed Sep 16, 2020

jngrad reopened this Sep 16, 2020

jngrad mentioned this issue Sep 16, 2020

HIP issue list as discussed in the offline meeting #2973

Closed

This was referenced Oct 22, 2020

Remove ROCm integration #3966

Merged

CI build failed for merged PR #3967

Closed

kodiakhq bot closed this as completed in #3966 Oct 23, 2020

jngrad mentioned this issue Oct 23, 2020

New shape: Quarterpipe #3960

Closed

jngrad mentioned this issue Nov 16, 2023

Barnes-Hut calculations are wrong when ESPResSo is compiled in Debug mode #4774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DipolarBarnesHutGpu and DipolarDirectSumGpu tests fail on ROCm #3895

DipolarBarnesHutGpu and DipolarDirectSumGpu tests fail on ROCm #3895

espresso-ci commented Sep 14, 2020

jngrad commented Sep 15, 2020

jngrad commented Sep 15, 2020

mkuron commented Sep 15, 2020

mkuron commented Sep 16, 2020

jngrad commented Sep 16, 2020

jngrad commented Sep 16, 2020

jngrad commented Sep 16, 2020

jngrad commented Oct 13, 2020

jngrad commented Oct 23, 2020

jngrad commented Oct 23, 2020

DipolarBarnesHutGpu and DipolarDirectSumGpu tests fail on ROCm #3895

DipolarBarnesHutGpu and DipolarDirectSumGpu tests fail on ROCm #3895

Comments

espresso-ci commented Sep 14, 2020

jngrad commented Sep 15, 2020

jngrad commented Sep 15, 2020

mkuron commented Sep 15, 2020

mkuron commented Sep 16, 2020

jngrad commented Sep 16, 2020

jngrad commented Sep 16, 2020

jngrad commented Sep 16, 2020

jngrad commented Oct 13, 2020

jngrad commented Oct 23, 2020

jngrad commented Oct 23, 2020