Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS-Anvil server tests spordically fail #3694

Closed
randombit opened this issue Sep 13, 2023 · 7 comments · Fixed by #3773
Closed

TLS-Anvil server tests spordically fail #3694

randombit opened this issue Sep 13, 2023 · 7 comments · Fixed by #3773
Assignees
Labels
Milestone

Comments

@randombit
Copy link
Owner

The test server.tls13.rfc8446.HelloRetryRequest.selectsSameCipherSuiteAllAtOnce seems to fail every so often

https://github.com/randombit/botan/actions/runs/6079851755/job/16492923998
https://github.com/randombit/botan/actions/runs/6167596027/job/16738779015

Error: server.tls13.rfc8446.HelloRetryRequest.selectsSameCipherSuiteAllAtOnce - Unexpected result 'PARTIALLY_FAILED'
  
  Class Name: 'de.rub.nds.tlstest.suite.tests.server.tls13.rfc8446.HelloRetryRequest'
  Method Name: 'selectsSameCipherSuiteAllAtOnce'
  
  RFC 8446, Section 4.2.10 Early Data Indication:
  Servers MUST ensure that they negotiate the same cipher suite when receiving a conformant updated ClientHello
  
  Result: PARTIALLY_FAILED (expected STRICTLY_SUCCEEDED)

(cc @FAlbertDev)

@FAlbertDev
Copy link
Collaborator

I'll look into this!

@FAlbertDev
Copy link
Collaborator

Well, the connection in this test sometimes runs into a timeout. Btw. there was also another test that failed once: selectsSameCipherSuite in https://github.com/randombit/botan/actions/runs/5915148567/job/16041123636.

I tried both tests locally and had no problems at all, even when configuring a very small timeout of 50ms. Since the ci action runners are sometimes relatively slow, I configured a timeout of 5 sec per TLS session. I thought this was enough. My only guess is that the GitHub runner takes a nap for whatever reason, taking longer than these 5 seconds to send the respective TLS packets. Since HelloRetryRequest tests take longer due to the additional roundtrip, it becomes only apparent in these tests.

I can't think of any other reason. What do you think? Can GitHub runners really be this slow?
If my guess is correct, I can adjust the timeout to 10 sec or 20 sec, and we should be fine.

@reneme
Copy link
Collaborator

reneme commented Sep 14, 2023

Can GitHub runners really be this slow?

Lags of five seconds sound quite unlikely to me, tbh. It could also be some sort of race condition in the CLI's TLS code. Some very narrow races tend to show up much more frequently on GitHub Actions than on local hosts, because the runners are slower. 🤡

Would be amazing to get a stack trace of the server process when it fails with a timeout. Is that something the TLS anvil toolchain supports?

@randombit
Copy link
Owner Author

Might really be that slow, might be a race in our code - either seems like a plausible explanation to me.

Might be worth trying to run this test while the server process is executed under helgrind (https://valgrind.org/docs/manual/hg-manual.html).

@FAlbertDev
Copy link
Collaborator

I admit I have some trouble understanding the helgrind output. I ran the tests with the tls_http_server and helgrind: valgrind --tool=helgrind ./botan tls_http_server <args> and got this log, which contains various possible data races.

For experimentation, I even ran the tests against the normal tls_server valgrind --tool=helgrind ./botan tls_server <args> and also got some data races. See (log). They seem to be within the RSA code. Within the RSA code, I found the following:

#if defined(BOTAN_HAS_THREAD_UTILS) && !defined(BOTAN_HAS_VALGRIND)
   #define BOTAN_RSA_USE_ASYNC
#endif

This means asynchronous RSA is only used when the --with-valgrind flag is not set. I did not set the flag, because I didn't know about it. I know this flag is used within the nightly valgrind test, which hides async RSA. Is there a reason why we hide async RSA for these tests?

There are also a bunch of races within the ASIO stuff of the tls_http_server, but it's hard for me to interpret these results...

@reneme
Copy link
Collaborator

reneme commented Sep 15, 2023

Let's look into it together. Maybe we can ultimately combine that with #3659.

@reneme reneme added the bug label Sep 15, 2023
@reneme reneme added this to the Botan 3.2.0 milestone Sep 15, 2023
@lieser lieser modified the milestones: Botan 3.2.0, Botan 3.3.0 Oct 10, 2023
@FAlbertDev
Copy link
Collaborator

I misinterpreted the TLS-Anvil log. The tests did not fail due to a timeout, as I thought, but for another reason. Botan does not cause it, but TLS-Anvil, it seems. To address this issue, I opened a PR for TLS-Anvil. I'm optimistic about fixing the problem soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants