Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI build failed for merged PR #3587

Closed
espresso-ci opened this issue Mar 19, 2020 · 4 comments · Fixed by #3623
Closed

CI build failed for merged PR #3587

espresso-ci opened this issue Mar 19, 2020 · 4 comments · Fixed by #3623

Comments

@espresso-ci
Copy link

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/pipelines/11478

@jngrad
Copy link
Member

jngrad commented Mar 19, 2020

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/215951
duplicate of #3581
@mkuron the ek_charged_plate test has become flaky on ROCm recently.

@mkuron
Copy link
Member

mkuron commented Mar 19, 2020

That's odd. There haven't been any changes to the EK recently (or to ROCm) as far as I remember. Do you have an easy way of checking the CI logs to see when it popped up for the first time?

@jngrad
Copy link
Member

jngrad commented Mar 19, 2020

As root on elk, you can do

grep -rF 'ek_charged_plate (Failed)' /srv/gitlab/artifacts/*/*/*/2020_03_*/*/*/job.log

@jngrad
Copy link
Member

jngrad commented Mar 19, 2020

To limit the results to the python branch, to ROCm builds and to the last 3 months:

grep -l -P 'Checking out .{6,12} as python' $(grep -l -F 'rocm-python3:latest' $(grep -l -F 'ek_charged_plate (Failed)' /srv/gitlab/artifacts/*/*/*/2020_0?_*/*/*/job.log))

There are only two results. Read files with less -R to skip most of the color encoding symbols. Removing the python constraint generates a lot of PR logfiles where most tests failed.

@kodiakhq kodiakhq bot closed this as completed in #3623 Apr 3, 2020
kodiakhq bot added a commit that referenced this issue Apr 3, 2020
The `ln -s /opt/rocm/bin/hcc* /opt/rocm/hip/bin/` issue has been worked around by properly setting `HCC_PATH` on the CMake side.
The shutdown issue has been worked around by replacing interrupts with polling (suggested at ROCm/roctracer#22 (comment)). Something is wrong with the destruction order in our code, but I cannot easily identify what. It's not the missing `cudaDestoryStream` though.

Fixes #3620 (according to `ctest -R save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 --repeat-until-fail 1000`).
Fixes #3587 (according to `ctest -R ek_charged_plate --repeat-until-fail 100`).

**TODO**
- https://github.com/espressomd/docker/blob/master/docker/rocm-python3/Dockerfile-latest needs to be updated to ROCm 3.3 once this pull request is merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants