Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up test suite to <20 minutes from >30 minutes #6127

Closed
derekbruening opened this issue Jun 8, 2023 · 4 comments · Fixed by #6563
Closed

Speed up test suite to <20 minutes from >30 minutes #6127

derekbruening opened this issue Jun 8, 2023 · 4 comments · Fixed by #6563

Comments

@derekbruening
Copy link
Contributor

Xref some discussion at #6096 (comment) and #6096 (comment) at adding parallelism to the ci-aarch64-native runner to avoid a backlog from simultaneous PR's waiting on each other.

The latency of the ci-aarch64-native job is also something to look at: when it was Jenkins it was pretty fast, under 10 minutes IIRC. Now you can see it is consistently over 33 minutes and is by far the slowest of our jobs:

https://github.com/DynamoRIO/dynamorio/actions/workflows/ci-aarch64-native.yml

The general goal is to be under 20 minutes. The Windows and Linux x86 jobs have gotten slower over time and are both over 20 minutes now. This issue covers trying to speed all of these up. The ci-aarch64-native seems the highest priority at the longest and because it used to be faster making it seems like there may be some low-hanging fruit there for improvements.

@joshua-warburton
Copy link
Collaborator

I've recently looked into it an the length of the aarch64 tests is almost completely due to the client attach/detach tests all timing out after 10 minutes.

We should be able to reduce this by either completely disabling the de/attach tests, fixing them to actually work, or reducing the timeouts.

@derekbruening
Copy link
Contributor Author

Looks like that is #5740 which says it happened after a Jenkins hardware change.

The client.detach test did not exist until 2 days ago in PR #6513. It has been flaky on x86-64 since being added which I filed as #6536; the failures look like the client.attach_test failures which had happened a few times on the Linux x86-64 suite: #6452. But, those x86-64 failures are not timeouts, so it is not clear any of that is related to #5740.

derekbruening added a commit that referenced this issue Jan 12, 2024
Now that we've enabled ptrace privileges on the a64 testing machine we
can remove the attach/detach tests from the flaky list as they no
longer hang.  The attach test passed 100x in a row for me so it
doesn't seem to be hitting flakes seen on other platforms like #6452.

Issue: #5740, #6558, #6127
Fixes #5740
derekbruening added a commit that referenced this issue Jan 12, 2024
Now that we've enabled ptrace privileges on the a64 testing machine we
can remove the attach/detach tests from the flaky list as they no longer
hang. The attach_test, attach_blocking, and deatch_test each passed 200x
in a row on this machine so it doesn't seem to be hitting flakes seen on
other platforms like #6452.

Issue: #5740, #6558, #6127
Fixes #5740
@derekbruening derekbruening self-assigned this Jan 12, 2024
@derekbruening
Copy link
Contributor Author

You can see the drop from 33 mins down to 8 mins at the point I enabled ptrace: https://github.com/DynamoRIO/dynamorio/actions/workflows/ci-aarch64-native.yml

Maybe the only final action item would be to reduce the timeouts: but we already did that, with the default at 90s. So I'm confused as to why the attach/detach tests would take 10 mins to time out: I don't see timeout overrides for those. Hmm actually it seems to only explicitly set a CTest timeout property for runcmp -- I will fix that, as part of this and #6558.

derekbruening added a commit that referenced this issue Jan 12, 2024
Before, we relied on drrun -s for all test suite timeouts except for
runcmp tests where we set a CTest timeout.  This resulted in the
default 10 minute CTest timeout for all tests, which was the only
timeout for runall tests and caused long suite times on the AArch64
machine which accidentally had no ptrace privileges (#5740, #6558,

Here, we set the CTest time for runall in addition to runcmp, and for
all other tests with no timeout specified (which are presumably
relying on drrun -s) we set a timeout of the drrun timeout plus 30
seconds.

Tested on the attach test:

Before:
```
123: Test timeout computed to be: 1500
```
Now:
```
$ echo 1 | sudo tee /proc/sys/kernel/yama/ptrace_scope; /usr/bin/time ctest -V -R client.attach_test; echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
...
    Start 121: code_api|client.attach_test
...
121: Test timeout computed to be: 90
1/1 Test #121: code_api|client.attach_test ......***Timeout  90.11 sec
...
The following tests FAILED:
	121 - code_api|client.attach_test (Timeout)
...
Command exited with non-zero status 8
1.13user 0.80system 1:30.14elapsed 2%CPU (0avgtext+0avgdata 13196maxresident)k
```

Fixes #6127
Issue: #6127, #6558, #5740
@joshua-warburton
Copy link
Collaborator

Glad to see it. I'm making sure this is set properly on the new AWS runners too.

derekbruening added a commit that referenced this issue Jan 16, 2024
Before, we relied on drrun -s for all test suite timeouts except for
runcmp tests where we set a CTest timeout. This resulted in the default
10 minute CTest timeout for all tests, which was the only timeout for
runall tests and caused long suite times on the AArch64 machine which
accidentally had no ptrace privileges (#5740, #6558,

Here, we set the CTest time for runall in addition to runcmp, and for
all other tests with no timeout specified (which are presumably relying
on drrun -s) we set a timeout of the drrun timeout plus 30 seconds.

Tested on the attach test:

Before:
```
123: Test timeout computed to be: 1500
```
Now:
```
$ echo 1 | sudo tee /proc/sys/kernel/yama/ptrace_scope; /usr/bin/time ctest -V -R client.attach_test; echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
...
    Start 121: code_api|client.attach_test
...
121: Test timeout computed to be: 90
1/1 Test #121: code_api|client.attach_test ......***Timeout  90.11 sec
...
The following tests FAILED:
	121 - code_api|client.attach_test (Timeout)
...
Command exited with non-zero status 8
1.13user 0.80system 1:30.14elapsed 2%CPU (0avgtext+0avgdata 13196maxresident)k
```

The property being set on more tests was confirmed on debug x86-64:
Before:
```
$ grep -c TIMEOUT suite/tests/CTestTestfile.cmake
94
```
After:
```
$ grep -c TIMEOUT suite/tests/CTestTestfile.cmake
463
```
There seem to still be a few missing the property: the ones that don't
go through suite/. There were other efforts to avoid hangs on those such
as PR #6137.

Issue: #6127, #6558, #5740
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants