Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Race conditions in test_cli.py when spawning test server processes #4178

Merged

Conversation

reneme
Copy link
Collaborator

@reneme reneme commented Jul 4, 2024

Pull Request Dependencies

Description

The test_cli.py script handles and orchestrates a few long-running processes (namely ./botan tls_server, ./botan tls_client, ./botan tls_proxy). For those tests to work properly, we have to ensure that the server processes are ready before trying to interact with them. Especially on the CI where runtime behavior is notoriously unpredictable. Until now, this was done by invoking time.sleep(1) after launching a server process. This worked most of the time, but introduced unnecessary waiting time while still being potentially racey.

This now uses Python's asyncio module to handle such long-running processes and interact with them. Each CLI server now reports a "Listening for new connections..." string on stdout once its ready to receive connections. The asyncio-based wrapper can wait for this string and thus ensure that the server is ready before the test continues. Also, the wrapper makes sure to timeout on any of those waits (hard-coded 15 seconds) and kill the process if necessary.

Result

Such tests can now rely on their server processes being in a defined state, without sleeping some arbitrary time and hoping for the best. As a side effect, running ./test_cli.py --threads=4 now takes 2.5 seconds on my laptop, instead of the more than 20 seconds it used to take due to the time.sleep() workarounds.

Fixes #4112

@reneme reneme added the enhancement Enhancement or new feature label Jul 4, 2024
@reneme reneme added this to the Botan 3.5.0 milestone Jul 4, 2024
@reneme reneme self-assigned this Jul 4, 2024
@coveralls
Copy link

Coverage Status

coverage: 91.729% (-0.002%) from 91.731%
when pulling 3294602 on Rohde-Schwarz:fix/race_conditions_in_test_cli
into f4c26e4 on randombit:master.

Copy link
Owner

@randombit randombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work!

The pylint failures in CI look relevant. If there is anything it's warning about that isn't sensible to change, feel free to locally disable the warning.

This uses Python's async process handling based on the asyncio module
as a replacement for the low-level use of subprocess.Popen along with
arbitrary (and inherently racey) time.sleep() calls.
@reneme reneme force-pushed the fix/race_conditions_in_test_cli branch from 3294602 to 934f3fe Compare July 4, 2024 09:54
@reneme
Copy link
Collaborator Author

reneme commented Jul 4, 2024

In the first run macOS 14 (on Apple Silicon) failed obscurely:

     [...]
     INFO: Ran cli_zfec_tests in 0.06 sec
     INFO: Ran cli_trust_root_tests in 0.54 sec
     INFO: Ran cli_tls_proxy_tests in 5.14 sec
  Ran 234 tests with 0 failures in 6.40 seconds
  Command 'python3 ./src/scripts/test_cli.py --threads=3 ./botan' failed with error code -11

.... looks like the interpreter crashed after the script had finished. 😨
This didn't show up in a second CI run, but it is certainly worrisome. Obviously I cannot reproduce on my MacBook Air...

I'd suggest to just merge and see whether that comes back. Perhaps we'll end up having to update the python interpreter in the build job setup or disable those tests on macOS 14 for some time. 😒

@coveralls
Copy link

Coverage Status

coverage: 91.723% (+0.001%) from 91.722%
when pulling 934f3fe on Rohde-Schwarz:fix/race_conditions_in_test_cli
into a75bb25 on randombit:master.

@randombit
Copy link
Owner

I think the macOS failure is #3991 so unrelated to this change.

@reneme
Copy link
Collaborator Author

reneme commented Jul 4, 2024

Let's hope so. In it goes!

@reneme reneme merged commit 3e65302 into randombit:master Jul 4, 2024
39 checks passed
@reneme reneme deleted the fix/race_conditions_in_test_cli branch July 4, 2024 10:55
@randombit
Copy link
Owner

Couple of Windows builds in the merge are failing, maybe we need more than 15 seconds in practice on CI?

randombit added a commit that referenced this pull request Jul 4, 2024
This was disabled in #3845 due to flakyness, then thought possibly
fixed and enabled again in #4178. However with #4178 merged the
test still occasionally fails. Disable it again pending diagnosis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Somewhat regular failures of gcc sanitizer tls_proxy cli test
3 participants