-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Somewhat regular failures of gcc sanitizer tls_proxy cli test #4112
Comments
I've tried running it locally a few hundred times with sanitizers enabled, no crash |
https://github.com/randombit/botan/actions/runs/9533945244/job/26277782022
|
My guess: The Lines 284 to 298 in d24c2c3
... now: when we do get into this situation? :) |
Another interesting point: Apparently, we had issues with this test on windows about half a year ago. There, we simply disabled it. Perhaps that's the same root-cause. Lines 1302 to 1305 in d24c2c3
|
If the TLS session to the client was established (when tls_session_activated() was called), and the connection to the server was also established successfully (ec in onConnect() callback was not set); but -- in the mean time -- the this- pointer was deallocated via std::enable_shared_from_this, we end up in a use- after free situation. This sporadically apeeared in CI but wasn't reproducible locally, see randombit#4112.
So the patch referenced above seems to "work"; on the GCC sanitizer job this seems to no-longer produce an ASan error, but a failed test. I don't think that's coincidental, because it didn't fail on any other CI target.
|
Drilling deeper: The ASan stack traces claim that the pointer was freed in Lines 284 to 298 in d24c2c3
Interestingly, when called from the Lines 256 to 282 in d24c2c3
To the best of my knowledge, asio's async operations won't ever call the given callback immediately in the same call stack. Though, the ASan call stack doesn't really support that, given that the de-allocations happens with the call stack coming from @randombit Did this start to appear after we switched to Ubuntu 24.04, perhaps a new boost version? The switch to Ubuntu was merged on June 5th, this ticket was created on June 11th. But did we see that before? |
In any case: The
"Server read failed End of file" probably means that the server socket was found closed when a read was attempted. "Read failed Operation canceled" probably points to a timeout of the client. Both error states don't spawn another async operation that could hang on to the shared pointer. Likely, some unlucky constellation of CI build machine stalling, client timeout and queuing-order of async handlers in "Thread T2" can cause a situation to occur where Its not unthinkable, that |
Similarly the windows build reports:
... these errors don't seem to produce a non-zero error code from the CLI invocation. Hence, I had to make the print out more aggressive. Edit: Turns out: These error messages are actually somewhat expected, because |
If the TLS session to the client was established (when tls_session_activated() was called), and the connection to the server was also established successfully (ec in onConnect() callback was not set); but -- in the mean time -- the this- pointer was deallocated via std::enable_shared_from_this, we end up in a use- after free situation. This sporadically apeeared in CI but wasn't reproducible locally, see randombit#4112.
It seems that -- after fixing the actual use-after-free -- we're simply running into a timeout. The test case seems to be running for a ridiculously long 10 seconds on the CI. |
At least on the gcc sanitizer CI job there seems to be an interaction between That said: on windows it fails regardless with " Read failed The network connection was aborted by the local system", which still sounds like a timeout to me. |
I've been playing Python's |
If the TLS session to the client was established (when tls_session_activated() was called), and the connection to the server was also established successfully (ec in onConnect() callback was not set); but -- in the mean time -- the this- pointer was deallocated via std::enable_shared_from_this, we end up in a use- after free situation. This sporadically appeared in CI but wasn't reproducible locally, see randombit#4112.
Nothing but
https://github.com/randombit/botan/actions/runs/9439667258/job/25999341641?pr=4042
I've seen this several times, always on this specific build (not for instance clang)
Probably better logging from the CLI test runner would help
The text was updated successfully, but these errors were encountered: