Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Somewhat regular failures of gcc sanitizer tls_proxy cli test #4112

Closed
randombit opened this issue Jun 10, 2024 · 11 comments · Fixed by #4178
Closed

Somewhat regular failures of gcc sanitizer tls_proxy cli test #4112

randombit opened this issue Jun 10, 2024 · 11 comments · Fixed by #4178

Comments

@randombit
Copy link
Owner

Nothing but

ERROR: Unexpected return code from tls_proxy 1

https://github.com/randombit/botan/actions/runs/9439667258/job/25999341641?pr=4042

I've seen this several times, always on this specific build (not for instance clang)

Probably better logging from the CLI test runner would help

@randombit
Copy link
Owner Author

I've tried running it locally a few hundred times with sanitizers enabled, no crash ¯\_(ツ)_/¯

@randombit
Copy link
Owner Author

https://github.com/randombit/botan/actions/runs/9533945244/job/26277782022

2024-06-16T06:42:46.2160516Z    INFO: Stderr: =================================================================
2024-06-16T06:42:46.2162791Z ==16939==ERROR: AddressSanitizer: heap-use-after-free on address 0x515000002890 at pc 0x5601f64f5e52 bp 0x7fd5cbbfeeb0 sp 0x7fd5cbbfeea0
2024-06-16T06:42:46.2164664Z READ of size 8 at 0x515000002890 thread T2
2024-06-16T06:42:46.2169102Z     #0 0x5601f64f5e51 in boost::asio::detail::iterator_connect_op<boost::asio::ip::tcp, boost::asio::any_io_executor, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp>, boost::asio::detail::default_connect_condition, Botan_CLI::(anonymous namespace)::tls_proxy_session::tls_session_activated()::{lambda(boost::system::error_code, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp> const&)#1}>::operator()(boost::system::error_code, int) (/home/runner/work/botan/botan/build/botan+0xb06e51) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2178518Z     #1 0x5601f64f7273 in boost::asio::detail::reactive_socket_connect_op<boost::asio::detail::iterator_connect_op<boost::asio::ip::tcp, boost::asio::any_io_executor, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp>, boost::asio::detail::default_connect_condition, Botan_CLI::(anonymous namespace)::tls_proxy_session::tls_session_activated()::{lambda(boost::system::error_code, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp> const&)#1}>, boost::asio::any_io_executor>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (/home/runner/work/botan/botan/build/botan+0xb08273) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2186288Z     #2 0x5601f6312d60 in boost::asio::detail::epoll_reactor::descriptor_state::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (/home/runner/work/botan/botan/build/botan+0x923d60) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2190797Z     #3 0x5601f630d518 in boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) (/home/runner/work/botan/botan/build/botan+0x91e518) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2194545Z     #4 0x5601f630e687 in boost::asio::detail::scheduler::run(boost::system::error_code&) (/home/runner/work/botan/botan/build/botan+0x91f687) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2197502Z     #5 0x5601f6507214 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<Botan_CLI::TLS_Proxy::go()::{lambda()#1}> > >::_M_run() (/home/runner/work/botan/botan/build/botan+0xb18214) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2200087Z     #6 0x7fd5d0ceabb3  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xeabb3) (BuildId: 40b9b0d17fdeebfb57331304da2b7f85e1396ef2)
2024-06-16T06:42:46.2201727Z     #7 0x7fd5d845e109 in asan_thread_start ../../../../src/libsanitizer/asan/asan_interceptors.cpp:234
2024-06-16T06:42:46.2203527Z     #8 0x7fd5d009ca93  (/lib/x86_64-linux-gnu/libc.so.6+0x9ca93) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
2024-06-16T06:42:46.2205220Z     #9 0x7fd5d0129c3b  (/lib/x86_64-linux-gnu/libc.so.6+0x129c3b) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
2024-06-16T06:42:46.2206118Z 
2024-06-16T06:42:46.2206788Z 0x515000002890 is located 16 bytes inside of 464-byte region [0x515000002880,0x515000002a50)
2024-06-16T06:42:46.2207702Z freed by thread T2 here:
2024-06-16T06:42:46.2208669Z     #0 0x7fd5d84fd738 in operator delete(void*, unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:164
2024-06-16T06:42:46.2212981Z     #1 0x5601f64d702f in std::_Sp_counted_ptr_inplace<Botan_CLI::(anonymous namespace)::tls_proxy_session, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_destroy() (/home/runner/work/botan/botan/build/botan+0xae802f) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2216341Z     #2 0x5601f612428b in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use_cold() (/home/runner/work/botan/botan/build/botan+0x73528b) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2219047Z     #3 0x5601f61246ab in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (/home/runner/work/botan/botan/build/botan+0x7356ab) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2221863Z     #4 0x5601f61416aa in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (/home/runner/work/botan/botan/build/botan+0x7526aa) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2224920Z     #5 0x5601f64f31a3 in Botan_CLI::(anonymous namespace)::tls_proxy_session::server_read(boost::system::error_code const&, unsigned long) (/home/runner/work/botan/botan/build/botan+0xb041a3) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2230360Z     #6 0x5601f64f3965 in boost::asio::detail::iterator_connect_op<boost::asio::ip::tcp, boost::asio::any_io_executor, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp>, boost::asio::detail::default_connect_condition, Botan_CLI::(anonymous namespace)::tls_proxy_session::tls_session_activated()::{lambda(boost::system::error_code, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp> const&)#1}>::operator()(boost::system::error_code, int) (/home/runner/work/botan/botan/build/botan+0xb04965) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2239196Z     #7 0x5601f64f7273 in boost::asio::detail::reactive_socket_connect_op<boost::asio::detail::iterator_connect_op<boost::asio::ip::tcp, boost::asio::any_io_executor, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp>, boost::asio::detail::default_connect_condition, Botan_CLI::(anonymous namespace)::tls_proxy_session::tls_session_activated()::{lambda(boost::system::error_code, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp> const&)#1}>, boost::asio::any_io_executor>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (/home/runner/work/botan/botan/build/botan+0xb08273) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2246055Z     #8 0x5601f6312d60 in boost::asio::detail::epoll_reactor::descriptor_state::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (/home/runner/work/botan/botan/build/botan+0x923d60) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2250256Z     #9 0x5601f630d518 in boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) (/home/runner/work/botan/botan/build/botan+0x91e518) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2253951Z     #10 0x5601f630e687 in boost::asio::detail::scheduler::run(boost::system::error_code&) (/home/runner/work/botan/botan/build/botan+0x91f687) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2257152Z     #11 0x5601f6507214 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<Botan_CLI::TLS_Proxy::go()::{lambda()#1}> > >::_M_run() (/home/runner/work/botan/botan/build/botan+0xb18214) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2259816Z     #12 0x7fd5d0ceabb3  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xeabb3) (BuildId: 40b9b0d17fdeebfb57331304da2b7f85e1396ef2)
2024-06-16T06:42:46.2260777Z 
2024-06-16T06:42:46.2261004Z previously allocated by thread T0 here:
2024-06-16T06:42:46.2262179Z     #0 0x7fd5d84fc698 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:95
2024-06-16T06:42:46.2264348Z     #1 0x5601f6519416 in Botan_CLI::(anonymous namespace)::tls_proxy_server::make_session() (/home/runner/work/botan/botan/build/botan+0xb2a416) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2266895Z     #2 0x5601f651eaf6 in Botan_CLI::TLS_Proxy::go() (/home/runner/work/botan/botan/build/botan+0xb2faf6) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2270527Z     #3 0x5601f6121b04 in Botan_CLI::Command::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (/home/runner/work/botan/botan/build/botan+0x732b04) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2274019Z     #4 0x5601f6170fe6 in main (/home/runner/work/botan/botan/build/botan+0x781fe6) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2275990Z     #5 0x7fd5d002a1c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
2024-06-16T06:42:46.2277825Z     #6 0x7fd5d002a28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
2024-06-16T06:42:46.2279616Z     #7 0x5601f60f2f34 in _start (/home/runner/work/botan/botan/build/botan+0x703f34) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2280663Z 
2024-06-16T06:42:46.2280841Z Thread T2 created by T0 here:
2024-06-16T06:42:46.2281851Z     #0 0x7fd5d84f3469 in pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:245
2024-06-16T06:42:46.2284382Z     #1 0x7fd5d0ceac88 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xeac88) (BuildId: 40b9b0d17fdeebfb57331304da2b7f85e1396ef2)
2024-06-16T06:42:46.2288338Z     #2 0x5601f6121b04 in Botan_CLI::Command::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) (/home/runner/work/botan/botan/build/botan+0x732b04) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2292083Z     #3 0x5601f6170fe6 in main (/home/runner/work/botan/botan/build/botan+0x781fe6) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2294092Z     #4 0x7fd5d002a1c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
2024-06-16T06:42:46.2295904Z     #5 0x7fd5d002a28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
2024-06-16T06:42:46.2297778Z     #6 0x5601f60f2f34 in _start (/home/runner/work/botan/botan/build/botan+0x703f34) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3)
2024-06-16T06:42:46.2298879Z 
2024-06-16T06:42:46.2303906Z SUMMARY: AddressSanitizer: heap-use-after-free (/home/runner/work/botan/botan/build/botan+0xb06e51) (BuildId: c57781d1296d22b403cc28959fc40e0ba7e36ad3) in boost::asio::detail::iterator_connect_op<boost::asio::ip::tcp, boost::asio::any_io_executor, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp>, boost::asio::detail::default_connect_condition, Botan_CLI::(anonymous namespace)::tls_proxy_session::tls_session_activated()::{lambda(boost::system::error_code, boost::asio::ip::basic_resolver_iterator<boost::asio::ip::tcp> const&)#1}>::operator()(boost::system::error_code, int)
2024-06-16T06:42:46.2308426Z Shadow bytes around the buggy address:
2024-06-16T06:42:46.2309148Z   0x515000002600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2024-06-16T06:42:46.2309995Z   0x515000002680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2024-06-16T06:42:46.2310817Z   0x515000002700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2024-06-16T06:42:46.2311788Z   0x515000002780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa
2024-06-16T06:42:46.2312637Z   0x515000002800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
2024-06-16T06:42:46.2313535Z =>0x515000002880: fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd
2024-06-16T06:42:46.2314546Z   0x515000002900: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
2024-06-16T06:42:46.2315439Z   0x515000002980: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
2024-06-16T06:42:46.2316330Z   0x515000002a00: fd fd fd fd fd fd fd fd fd fd fa fa fa fa fa fa
2024-06-16T06:42:46.2317195Z   0x515000002a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
2024-06-16T06:42:46.2318076Z   0x515000002b00: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
2024-06-16T06:42:46.2319015Z Shadow byte legend (one shadow byte represents 8 application bytes):
2024-06-16T06:42:46.2319793Z   Addressable:           00
2024-06-16T06:42:46.2320334Z   Partially addressable: 01 02 03 04 05 06 07 
2024-06-16T06:42:46.2320924Z   Heap left redzone:       fa
2024-06-16T06:42:46.2321581Z   Freed heap region:       fd
2024-06-16T06:42:46.2322084Z   Stack left redzone:      f1
2024-06-16T06:42:46.2322588Z   Stack mid redzone:       f2
2024-06-16T06:42:46.2323098Z   Stack right redzone:     f3
2024-06-16T06:42:46.2323616Z   Stack after return:      f5
2024-06-16T06:42:46.2324157Z   Stack use after scope:   f8
2024-06-16T06:42:46.2324684Z   Global redzone:          f9
2024-06-16T06:42:46.2325214Z   Global init order:       f6
2024-06-16T06:42:46.2325669Z   Poisoned by user:        f7
2024-06-16T06:42:46.2326131Z   Container overflow:      fc
2024-06-16T06:42:46.2326586Z   Array cookie:            ac
2024-06-16T06:42:46.2327072Z   Intra object redzone:    bb
2024-06-16T06:42:46.2327550Z   ASan internal:           fe
2024-06-16T06:42:46.2328016Z   Left alloca redzone:     ca
2024-06-16T06:42:46.2328502Z   Right alloca redzone:    cb
2024-06-16T06:42:46.2328959Z ==16939==ABORTING

@reneme
Copy link
Collaborator

reneme commented Jun 24, 2024

My guess: The onConnect lambda in tls_session_activated() captures this as a bare pointer and is then used as a callback in the asio function async_connect. If, for some reason, this is freed via the shared pointer (that came out of the create factory and that is consistently used via enable_shared_from_this in the other callbacks) and then async_connect comes back without an error code, we would run into the situation suggested by asan.

botan/src/cli/tls_proxy.cpp

Lines 284 to 298 in d24c2c3

void tls_session_activated() override {
auto onConnect = [this](boost::system::error_code ec, const tcp::resolver::iterator& /*endpoint*/) {
if(ec) {
log_error("Server connection", ec);
return;
}
server_read(boost::system::error_code(), 0); // start read loop
proxy_write_to_server({});
};
async_connect(m_server_socket, m_server_endpoints, onConnect);
}
void tls_session_established(const Botan::TLS::Session_Summary& session) override {
m_hostname = session.server_info().hostname();
}

... now: when we do get into this situation? :)

@reneme
Copy link
Collaborator

reneme commented Jun 24, 2024

Another interesting point: Apparently, we had issues with this test on windows about half a year ago. There, we simply disabled it. Perhaps that's the same root-cause.

botan/src/scripts/test_cli.py

Lines 1302 to 1305 in d24c2c3

# In the Windows GitHub CI this sometimes fails, for reasons unknown.
# The connectionn to the TLS proxy is forcibly closed by the remote host.
if not run_socket_tests() or platform.system() == 'Windows' or not check_for_command("tls_proxy"):
return

reneme added a commit to Rohde-Schwarz/botan that referenced this issue Jun 24, 2024
If the TLS session to the client was established (when tls_session_activated()
was called), and the connection to the server was also established successfully
(ec in onConnect() callback was not set); but -- in the mean time -- the this-
pointer was deallocated via std::enable_shared_from_this, we end up in a use-
after free situation.

This sporadically apeeared in CI but wasn't reproducible locally, see randombit#4112.
@reneme
Copy link
Collaborator

reneme commented Jun 24, 2024

So the patch referenced above seems to "work"; on the GCC sanitizer job this seems to no-longer produce an ASan error, but a failed test. I don't think that's coincidental, because it didn't fail on any other CI target.

     INFO: Traceback (most recent call last):
    File "/home/runner/work/botan/botan/source/src/scripts/test_cli.py", line 1666, in run_test
      fn(tmp_dir)
    File "/home/runner/work/botan/botan/source/src/scripts/test_cli.py", line 1366, in cli_tls_proxy_tests
      resp = conn.getresponse()
             ^^^^^^^^^^^^^^^^^^
    File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
      response.begin()
    File "/usr/lib/python3.12/http/client.py", line 331, in begin
      version, status, reason = self._read_status()
                                ^^^^^^^^^^^^^^^^^^^
    File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
      raise RemoteDisconnected("Remote end closed connection without"
  http.client.RemoteDisconnected: Remote end closed connection without response
  
    ERROR: Test cli_tls_proxy_tests threw exception: Remote end closed connection without response

@reneme
Copy link
Collaborator

reneme commented Jun 24, 2024

Drilling deeper: The ASan stack traces claim that the pointer was freed in server_read() and then the use-after-free occured in the onConnect() lambda. Given that server_read() is called nowhere else then in the lambda itself, this doesn't seem like a race condition where "somewhere else" the shared pointer was freed. Instead, server_read() apparently destroys this and then onConnect tries to call proxy_write_to_server() on it.

botan/src/cli/tls_proxy.cpp

Lines 284 to 298 in d24c2c3

void tls_session_activated() override {
auto onConnect = [this](boost::system::error_code ec, const tcp::resolver::iterator& /*endpoint*/) {
if(ec) {
log_error("Server connection", ec);
return;
}
server_read(boost::system::error_code(), 0); // start read loop
proxy_write_to_server({});
};
async_connect(m_server_socket, m_server_endpoints, onConnect);
}
void tls_session_established(const Botan::TLS::Session_Summary& session) override {
m_hostname = session.server_info().hostname();
}

Interestingly, when called from the onConnect lambda, server_read() will always schedule an async operation (both ec and bytes_transferred are NULL). This async operation will hold a shared-from-this.

botan/src/cli/tls_proxy.cpp

Lines 256 to 282 in d24c2c3

void server_read(const boost::system::error_code& error, size_t bytes_transferred) {
if(error) {
log_error("Server read failed", error);
stop();
return;
}
try {
if(bytes_transferred) {
log_text_message("Server to client", &m_s2p[0], m_s2p.size());
log_binary_message("Server to client", &m_s2p[0], m_s2p.size());
m_tls->send(&m_s2p[0], bytes_transferred);
}
} catch(Botan::Exception& e) {
log_exception("TLS connection failed", e);
stop();
return;
}
m_s2p.resize(readbuf_size);
m_server_socket.async_read_some(boost::asio::buffer(&m_s2p[0], m_s2p.size()),
m_strand.wrap(boost::bind(&tls_proxy_session::server_read,
shared_from_this(),
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred)));
}

To the best of my knowledge, asio's async operations won't ever call the given callback immediately in the same call stack. Though, the ASan call stack doesn't really support that, given that the de-allocations happens with the call stack coming from boost::...::iterator_connect_op.

@randombit Did this start to appear after we switched to Ubuntu 24.04, perhaps a new boost version? The switch to Ubuntu was merged on June 5th, this ticket was created on June 11th. But did we see that before?

@reneme
Copy link
Collaborator

reneme commented Jun 24, 2024

In any case: The ./botan tls_proxy error messages above the ASan stack trace in the build job seem to support this:

  Read failed Operation canceled
  Server read failed End of file
  Read failed Operation canceled
  Server read failed End of file
  Read failed Operation canceled
  Server read failed End of file
  Read failed Operation canceled

"Server read failed End of file" probably means that the server socket was found closed when a read was attempted. "Read failed Operation canceled" probably points to a timeout of the client. Both error states don't spawn another async operation that could hang on to the shared pointer.

Likely, some unlucky constellation of CI build machine stalling, client timeout and queuing-order of async handlers in "Thread T2" can cause a situation to occur where onConnect is called assuming that m_server_socket was successfully connected, but the handler of client_read ran before, detected a timeout ("Read failed Operation canceled"), and closed m_server_socket.

Its not unthinkable, that async_read_some would detect a socket in "closed" state and invoke its handler immediately ("Server read failed End of file"), I guess. Eventually, onConnect() would call proxy_write_to_server on a stale this pointer (as this was destroyed in either of the handlers).

@reneme
Copy link
Collaborator

reneme commented Jun 24, 2024

Similarly the windows build reports:

     INFO: Stdout: Server read failed End of file
  Read failed The network connection was aborted by the local system
  Server read failed End of file
  Read failed The network connection was aborted by the local system
  Server read failed End of file
  Read failed The network connection was aborted by the local system
  Server read failed End of file
  Read failed End of file

... these errors don't seem to produce a non-zero error code from the CLI invocation. Hence, I had to make the print out more aggressive.

Edit: Turns out: These error messages are actually somewhat expected, because ./botan tls_proxxy doesn't have a graceful shutdown routine. Instead it just closes the sockets and lets the handlers come back with errors.

reneme added a commit to Rohde-Schwarz/botan that referenced this issue Jul 2, 2024
If the TLS session to the client was established (when tls_session_activated()
was called), and the connection to the server was also established successfully
(ec in onConnect() callback was not set); but -- in the mean time -- the this-
pointer was deallocated via std::enable_shared_from_this, we end up in a use-
after free situation.

This sporadically apeeared in CI but wasn't reproducible locally, see randombit#4112.
@reneme
Copy link
Collaborator

reneme commented Jul 2, 2024

It seems that -- after fixing the actual use-after-free -- we're simply running into a timeout. The test case seems to be running for a ridiculously long 10 seconds on the CI.

@reneme
Copy link
Collaborator

reneme commented Jul 2, 2024

At least on the gcc sanitizer CI job there seems to be an interaction between cli_tls_socket_tests and cli_tls_proxy_tests. If I disable the former, the latter runs without any problems. Even with an extended connection loop and running the entire test_cli.py multiple times. Before, it failed reliably in this configuration.

That said: on windows it fails regardless with " Read failed The network connection was aborted by the local system", which still sounds like a timeout to me.

@reneme
Copy link
Collaborator

reneme commented Jul 3, 2024

I've been playing Python's asyncio library to replace the racey and low-level subprocess.Popen() and time.sleep() approach. This yielded a much better wrapper for interacting with long-lived processes. Unfortunately, the tests still don't run reliably on CI. 😠

reneme added a commit to Rohde-Schwarz/botan that referenced this issue Jul 4, 2024
If the TLS session to the client was established (when tls_session_activated()
was called), and the connection to the server was also established successfully
(ec in onConnect() callback was not set); but -- in the mean time -- the this-
pointer was deallocated via std::enable_shared_from_this, we end up in a use-
after free situation.

This sporadically appeared in CI but wasn't reproducible locally, see randombit#4112.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants