Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool.drcachesim.threads-with-config-file occasional timeout on A64 Jenkins #4954

Closed
derekbruening opened this issue Jun 18, 2021 · 7 comments · Fixed by #5375
Closed

tool.drcachesim.threads-with-config-file occasional timeout on A64 Jenkins #4954

derekbruening opened this issue Jun 18, 2021 · 7 comments · Fixed by #5375

Comments

@derekbruening
Copy link
Contributor

Hit a timeout in PR #4952 on tool.drcachesim.threads-with-config-file: took 90.00s:
http://139.178.84.19:8080/job/DynamoRIO-AArch64-Precommit/98/consoleFull

$
] 90.00 sec

    -------------------------------------------------------------------
     Performance for solving AX=B Linear Equation using Jacobi method
     Running on DynamoRIO
     Client version (null)
    ...................................................................

     Matrix Size :  64
     Threads     :  4


     Started iteration 1 of the computation...

     Finished computing current solution distance in mode 0.
     Mode changed to 0.

     Started iteration 2 of the computation...

     Finished computing current solution distance in mode 0.
     Mode changed to 0.

     Started iteration 3 of the computation...

It ends there, in iteration 3.

This happened on PR #4949 too: http://139.178.84.19:8080/job/DynamoRIO-AArch64-Precommit/95/

It passed on re-running and took very little time:

41/50 Test #177: code_api|tool.drcachesim.threads-with-config-file ...............   Passed    0.30 sec

So it seems like it's not just always close the time limit and it's maybe an actual hang or something?

@derekbruening
Copy link
Contributor Author

The coherence test hit the same hang. It runs the same app so likely the same problem:

http://139.178.84.19:8080/job/DynamoRIO-AArch64-Precommit/126/consoleFull

50/50 Test #178: code_api|tool.drcachesim.coherence ..............................***Timeout 150.02 sec

    -------------------------------------------------------------------
     Performance for solving AX=B Linear Equation using Jacobi method
     Running on DynamoRIO
     Client version (null)
    ...................................................................

     Matrix Size :  64
     Threads     :  4


     Started iteration 1 of the computation...

     Finished computing current solution distance in mode 0.
     Mode changed to 0.

     Started iteration 2 of the computation...

@johnfxgalea
Copy link
Contributor

@derekbruening
Copy link
Contributor Author

Xref #3971 where coherence hung on Windows.

Hung again on AArch64 Jenkins: http://139.178.84.19:8080/job/DynamoRIO-AArch64-Precommit/161/consoleFull

@abhinav92003
Copy link
Contributor

I think we should add this test ignore list. Failures have become very frequent. It's blocking #4941 right now.

I see the original issue was a timeout, but we're seeing a regex mismatch too now: http://139.178.84.19:8080/job/DynamoRIO-AArch64-Precommit/178/console, http://139.178.84.19:8080/job/DynamoRIO-AArch64-Precommit/234/

abhinav92003 added a commit that referenced this issue Jul 15, 2021
Adds nativeexec tests to ignore list for x86-64. We've been seeing a lot of red
on CI due to failures on different options for these tests.

Also ignores common.fib for one combination of options on vs2017-32.

Also ignores tool.drcachesim.threads-with-config-file failures on Arch64.

Issue: #5010, #1807, #4954
@derekbruening
Copy link
Contributor Author

Hit again on coherence test on Jenkins: http://139.178.84.19:8080/job/DynamoRIO-AArch64-Precommit/271/consoleFull

@derekbruening
Copy link
Contributor Author

There may be just one underlying hang bug causing all of these failures: this one, #4928, #2417

@derekbruening
Copy link
Contributor Author

These all only hang in release build; can't reproduce a problem in debug

derekbruening added a commit that referenced this issue Feb 18, 2022
Adds private loader redirection of open, close, read, and write to
DR's syscall-wrapper versions (plus file descriptor isolation, for
open and close).  The libc write invokes pthread code for cancel
features, and we are not able to create a private libpthread or
isolate pthread resources (#956) which leads to poor interactions with
application pthread uses and observed hangs.

Tested on the AArch64 Jenkins machine where these tests all hung every
5 to 10 runs in release build before and now they succeed 20,000 times
in a row:
--------------------------------------------------
derek@dynamorio:~/dr/build_rel$ for i in sim.threads\$ sim.TLB-threads sim.coherence sim.threads-with; do echo $i; ctest --repeat-until-fail 20000 -R $i > RUN-$i 2>&1; done
sim.threads$
sim.TLB-threads
sim.coherence
sim.threads-with
derek@dynamorio:~/dr/build_rel$ grep -c Passed RUN-*
RUN-sim.coherence:20000
RUN-sim.threads$:20000
RUN-sim.threads-with:20000
RUN-sim.TLB-threads:20000
derek@dynamorio:~/dr/build_rel$ grep failed RUN-*
RUN-sim.coherence:100% tests passed, 0 tests failed out of 1
RUN-sim.threads$:100% tests passed, 0 tests failed out of 1
RUN-sim.threads-with:100% tests passed, 0 tests failed out of 1
RUN-sim.TLB-threads:100% tests passed, 0 tests failed out of 1
--------------------------------------------------

While at it, removes drcachesim.invariants which was tested as well
and has no failures.

Issue: #4928, #4954, #2417, #956
Fixes #4928
Fixes #4954
Fixes #2892
derekbruening added a commit that referenced this issue Feb 18, 2022
Adds private loader redirection of open, close, read, and write to
DR's syscall-wrapper versions (plus file descriptor isolation, for
open and close).  The libc write invokes pthread code for cancel
features, and we are not able to create a private libpthread or
isolate pthread resources (#956) which leads to poor interactions with
application pthread uses and observed hangs.

Tested on the AArch64 Jenkins machine where these tests all hung every
5 to 10 runs in release build before and now they succeed 20,000 times
in a row:
```
--------------------------------------------------
derek@dynamorio:~/dr/build_rel$ for i in sim.threads\$ sim.TLB-threads sim.coherence sim.threads-with; do echo $i; ctest --repeat-until-fail 20000 -R $i > RUN-$i 2>&1; done
sim.threads$
sim.TLB-threads
sim.coherence
sim.threads-with
derek@dynamorio:~/dr/build_rel$ grep -c Passed RUN-*
RUN-sim.coherence:20000
RUN-sim.threads$:20000
RUN-sim.threads-with:20000
RUN-sim.TLB-threads:20000
derek@dynamorio:~/dr/build_rel$ grep failed RUN-*
RUN-sim.coherence:100% tests passed, 0 tests failed out of 1
RUN-sim.threads$:100% tests passed, 0 tests failed out of 1
RUN-sim.threads-with:100% tests passed, 0 tests failed out of 1
RUN-sim.TLB-threads:100% tests passed, 0 tests failed out of 1
--------------------------------------------------
```
While at it, removes drcachesim.invariants which was tested as well
and has no failures, under the theory that the original failures were
these same release-build hangs.  Today, it's a debug-only test.

Issue: #4928, #4954, #2417, #956
Fixes #4928
Fixes #4954
Fixes #2892
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants