-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SessionWrapper use-after-free crash when tearing down sessions #6676
Conversation
|
Should be fixed now: a test was not working as originally intended which resulted in tearing down the client before the session, which resulted in both |
…tiplex-sessions-crash
To be clear, the Session has always outlived the SessionWrapper by design.
When the SessionWrapper is being finalized it makes the Session inactive
but still owned by the Connection so that the UNBIND/UNBOUND exchange can
happen. In Session::receive_download_message() we check if we’re active
before doing anything else to avoid calling into the potentially destroyed
SessionWrapper. We likely just don’t have a test for that in
Session::receive_error_message(). This isn’t really a race since it’s all
happening on the same thread.
…On Fri, May 26, 2023 at 2:26 AM Michael Wilkerson-Barker < ***@***.***> wrote:
What, How & Why?
There was a race condition between the SessionWrapper and the
ClientImpl::Session being torn down, where the Session could outlive the
SessionWrapper, leading to a use after free error and causing a seg fault.
The fixes to address this issue include:
- Updated the SessionWrapper pointer in Session to be a bind_ptr that
will extend the lifetime of the SessionWrapper until the Session has been
deleted and the SessionWrapper has been finalized.
- Added finalized flag in SessionWrapper and added extra checks to
make sure functions that rely on the SessionWrapper state aren't called
after it has been finalized.
- Added session ident history to ClientImpl::Connection to track the
history of previous sessions that have been opened so a protocol error is
not thrown if a stale binary message is received after the Session has been
deactivated.
- Added checks to the session binary message processing to ignore all
incoming messages, other than UNBOUND, if the Session is not in the
'Active' state.
Unfortunately, this issue was hard to test in core, although the SDKs have
been able to reproduce the issue more easily. I will continue to
investigate adding a test after this is integrated.
Fixes #6656 <#6656>
☑️ ToDos
- 📝 Changelog update
- [ ] 🚦 Tests (or not relevant)
- [ ] C-API, if public C++ API changed.
------------------------------
You can view, comment on, or merge this pull request online at:
#6676
Commit Summary
- c7f7899
<c7f7899>
Changed SessionWrapper pointer to bind_ptr; added session ident history
- 07cf26c
<07cf26c>
Updated changelog
File Changes
(4 files <https://github.com/realm/realm-core/pull/6676/files>)
- *M* CHANGELOG.md
<https://github.com/realm/realm-core/pull/6676/files#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4ed>
(1)
- *M* src/realm/sync/client.cpp
<https://github.com/realm/realm-core/pull/6676/files#diff-e712d62c0b4f91a4eb4517225f4a5a9c6bbfb7a7d7bf6320565b9b61c62fadce>
(166)
- *M* src/realm/sync/noinst/client_impl_base.cpp
<https://github.com/realm/realm-core/pull/6676/files#diff-706df6d59244f8a0cd22ad8c495b1b60083d88c1061d0b545f8e042c495bdd08>
(100)
- *M* src/realm/sync/noinst/client_impl_base.hpp
<https://github.com/realm/realm-core/pull/6676/files#diff-92701c7e3a08f0515894064c5a84cc17abc9ae1c972c2ada1cc081abf420b858>
(30)
Patch Links:
- https://github.com/realm/realm-core/pull/6676.patch
- https://github.com/realm/realm-core/pull/6676.diff
—
Reply to this email directly, view it on GitHub
<#6676>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOVUPHMBRQTW6YATLPNBDXIBEJVANCNFSM6AAAAAAYPZEP4E>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks for the feedback @jbreams - with session multiplexing, we were getting download and error messages from the server that were still pending in the network socket after the SessionWrapper had been destroyed, which led to crashes during the SDK testing of the Core build with multiplexing always enabled. |
@@ -1448,6 +1475,10 @@ void Connection::receive_ident_message(session_ident_type session_ident, SaltedF | |||
{ | |||
Session* sess = get_session(session_ident); | |||
if (REALM_UNLIKELY(!sess)) { | |||
if (check_session_history(session_ident)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this? can't we check the state of the session?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if the session is not active anymore it's just a no-op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I see we don't have a session in this case.. Out of curiosity, is the original error (and closing the connection) an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really isn't a client error if there is a pending network message that is received after the session has been deactivated, so there is no need to close the connection and restart any other sessions that are currently active. In this case we just print message (maybe it should be an info and not an error?) and just ignore it.
This is really only needed if session multiplexing is enabled, and doesn't matter if it is disabled, since there will only ever be one session per connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I have a suggestion (feel free to ignore). We could change close_due_to_*
(actually disconnect()) to use m_sessions
(or similar) to decide if it's the last session wanting to disconnect, and only then actually close the connection (a bit similar to the logic in one_more_active_unsuspended_session
). We then don't need the new check (and error message), although I agree the new error message makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - but I think it would be best in a separate PR.
…tiplex-sessions-crash
…tiplex-sessions-crash
@@ -1375,13 +1382,28 @@ void Connection::receive_pong(milliseconds_type timestamp) | |||
m_client.m_roundtrip_time_handler(m_previous_ping_rtt); // Throws | |||
} | |||
|
|||
bool Connection::check_session_history(session_ident_type session_ident) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the actual bug fix, and the m_finalized changes are just to catch invalid states more reliably? I definitely don't understand this code well enough to meaningfully review it, but it superficially looks like it might be replacing a UAF on other data members with a UAF on m_finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to prevent any pending messages in the websocket from closing the connection if they come in after the session has been closed. The other part of the fix is to not allow Session to call functions on the SessionWrapper reference if the session is currently deactivated or being deactivated
I am looking at the failing "sync: non-synced metadata table doesn't result in non-additive schema changes" test, but I don't believe these failures are related. |
…tiplex-sessions-crash
src/realm/sync/client.cpp
Outdated
return; | ||
else if (!status.is_ok()) | ||
throw Exception(status); | ||
if (!m_finalized) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this check be done inside the callback below instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We technically don't need it, since we're checking if m_sess == nullptr
before calling on the session pointer
@@ -1375,16 +1382,33 @@ void Connection::receive_pong(milliseconds_type timestamp) | |||
m_client.m_roundtrip_time_handler(m_previous_ping_rtt); // Throws | |||
} | |||
|
|||
Session* Connection::find_and_validate_session(session_ident_type session_ident, std::string_view message) noexcept | |||
{ | |||
if (session_ident != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could use an early return here to avoid extra indentation
if (!m_force_closed) { | ||
REALM_ASSERT(m_sess); | ||
ClientImpl::Connection& conn = m_sess->get_connection(); | ||
conn.initiate_session_deactivation(m_sess); // Throws | ||
|
||
// Delete the pending bootstrap store since it uses a reference to the logger in m_sess | ||
m_flx_pending_bootstrap_store.reset(); | ||
// Clear the subscription and migration store refs since they are owned by SyncSession | ||
m_flx_subscription_store.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
src/realm/sync/client.cpp
Outdated
REALM_ASSERT(m_actualized); | ||
REALM_ASSERT(m_sess); | ||
m_force_closed = true; | ||
if (!m_force_closed && !m_finalized) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use an early return
src/realm/sync/client.cpp
Outdated
return; | ||
else if (!status.is_ok()) | ||
throw Exception(status); | ||
if (REALM_LIKELY(!m_finalized && !m_force_closed)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
src/realm/sync/client.cpp
Outdated
return; | ||
else if (!status.is_ok()) | ||
throw Exception(status); | ||
if (REALM_LIKELY(!m_finalized && !m_force_closed)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
else if (!status.is_ok()) | ||
throw Exception(status); | ||
// Includes operation_aborted | ||
if (!status.is_ok()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
@@ -628,6 +629,9 @@ class ClientImpl::Connection { | |||
// The set of sessions associated with this connection. A session becomes | |||
// associated with a connection when it is activated. | |||
std::map<session_ident_type, std::unique_ptr<Session>> m_sessions; | |||
// Keep track of previously used sessions idents to see if a stale message was | |||
// received for a closed session | |||
std::vector<session_ident_type> m_session_history; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an error to receive any message after an ERROR
or UNBOUND
. Hence, you could use an std::unordered_set
here and remove the session ident when one of the messages above is received. I think this also helps if the session is revived (the ident would be added again..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good idea - if the client receives either of those from the server, it should never see another message for that session ident. The issue being addressed is when the client closes the session, either through the normal process of sending UNBIND or the Session is force closed. There could still be a complete message waiting to be processed in the event loop that could be passed back to the Connection after the Session has been deactivated and we don't want to cause the Connection to be torn down if session multiplexing is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the only "special" case is when the session is force closed (i.e, there is no UNBIND -> UNBOUND exchange), so in that case the session ident would not be removed. But I think it's worth to do the clean-up in all the other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The session history entry will be removed when the session is closed via the normal path - abrupt closures (e.g. force_close or terminate) will leave the session history in place after the session is deactivated, and this is likely where the stale messages were being received.
test/test_sync.cpp
Outdated
server.stop(); | ||
server_thread.join(); | ||
|
||
// The check is that we reach this point without deadlocking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would there be a deadlock here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a copy/paste from the test above that was duplicated. Really it should be "deadlocking or crash" - the description of the test describes what this test is intending to verify: a SyncSession in the process of starting up when the SyncClient is torn down. I will update this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but this is not my area of expertise.
src/realm/sync/client.cpp
Outdated
return m_wrapper.m_virt_path; | ||
} | ||
|
||
const std::string& SessionImpl::get_realm_path() const noexcept | ||
{ | ||
// Can only be called if the session is active or being activated | ||
REALM_ASSERT(m_state == State::Active || m_state == State::Unactivated); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great to see these extra assertions! Not blocking, but where you can, I prefer to add a bit of info to help debugging in case they are violated.
REALM_ASSERT(m_state == State::Active || m_state == State::Unactivated); | |
REALM_ASSERT_EX(m_state == State::Active || m_state == State::Unactivated, m_state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I agree it will be nice to have better debugging. I will update these.
@@ -1775,15 +1901,16 @@ void SessionWrapper::report_progress() | |||
util::Future<std::string> SessionWrapper::send_test_command(std::string body) | |||
{ | |||
if (!m_sess) { | |||
return util::Future<std::string>::make_ready( | |||
Status{ErrorCodes::RuntimeError, "session must be activated to send a test command"}); | |||
return Status{ErrorCodes::RuntimeError, "session must be activated to send a test command"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need a Future here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A future will be "copy constructed" in the return using the Status value. The Status
copy constructor in the Future
class calls make_ready
using the status value provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. just noticed the implicit constructor
* Updated release notes * Update catch2 v3.3.2 (#6605) * Make core infer platform and cpu_arch, while bundle_id must be provided by SDK's (#6612) * platform and cpu_arch are inferred by core, bundle_id must be provided by SDK's * update changelog * Return proper value for X86_64 arch Co-authored-by: Christian Melchior <[email protected]> * Get fine-grained platform for Apple devices * Fix tests * small fixes * fix more tests * Fix mistake in changelog --------- Co-authored-by: Christian Melchior <[email protected]> * use consistent rounding, following SERVER-75392 (#6477) * fix entries that went to the wrong change version (#6632) * Special-case main thread runloop scheduler * Improve SectionedResults performance SectionedResults used a std::map in a few places where the keys are a dense range (i.e. always [0..map.size())) and so they can be std::vector instead. The maps keyed on Mixed are now std::unordered_map. Change notifications now report changes as a `std::vector<IndexSet>` rather than `std::map<size_t, IndexSet>`. This is slower and uses more memory when the only sections that changed are near the end of a SectionedResults with a large number of sections, but is much faster if all sections changed or if the sections which changed are early in the SectionedResults. Change notifications now reuse buffers, which increases persistent memory usage slightly but significant reduces allocations. Change notifications for a single section now only compute the changes for that section rather than computing the full changes and then filtering out the changes for other sections. * use static_assert rather than a old home rolled one * fix warning of redefine of CHECK macro * fix unused function warning * silence warnings in bid128_to_string * Introduce BPlusTree::for_all * Prevent program from crashing when removing backlinks * Fix broken snapshot of collection of objects * Fix importing Results with deleted collection The result should be an empty result, not the whole table. * geospatial validation of polygons (#6607) * geospatial validation of polygons * Loop->Ring, added tests * use std::unique * changelog * Benchmark for full-text search * Allow to filter benchmark and run only list of specified names * Add simple benchmark for fulltext search with index * Filter out unresolved links in Dictionary::get_any() * Add support for early exit in BPlusTree::for_all() * Geospatial feedback (#6645) * verify local results match a server query * disallow geowithin on top level tables * fix geo queries with ANY/ALL/NONE * geospatial validation of points * rename GeoCenterSphere -> GeoCircle * review feedback * better testing and fix any/all/none geospatial * format * Geospatial basic queries benchmarks (#6621) * Add basic benchmarks for Geospatial type and queries * Less copying in GeoWithinCompare * Bring back caching of s2 region into Geospatial * remove transaction overhead from measurements * a couple small optimizations * formatting * simplify geospatial query evaluations * changelog --------- Co-authored-by: James Stone <[email protected]> * Updated baas server tag for CI (#6650) * Prepare release * Updated release notes * Access token refresh for websockets was not updating the location metadata (#6631) * Always refresh metadata on app login * Updated changelog * Always update location when requested; fix c_api test * Update test to properly evaluate websocket redirections; added one more test * Updated changelog and fixed compile warning * Added location checks back to test * added mutex locking around location updated state and reworked requesting location update to use flag * clang format and fix incorrect timeout value * Reworked update location logic a bit and removed unused function * Free mutex before calling completion on early exit in init_app_metadata * maybe fix a race in a test (#6651) * Use std::optional to store cached leaves in query nodes (#6653) Our use of aligned_storage was basically a complicated manual version of this. I was hoping this'd have binary size benefits, but it ended up making the library 100 bytes larger instead. Nonetheless, it greatly simplifies things. * Fix a few UBSan failures hit by tests * Avoid performing unaligned reads in Array::get_chunk() * Fix a lock order inversion in tests (#6666) The cycle was DaemonThread::m_running_on_change_mutex => RealmCoordinator::m_realm_mutex => SyncManager::m_mutex => RealmCoordinator::s_coordinator_mutex => DaemonThread::m_running_on_change_mutex, and it happened due to DaemonThread::remove() being called inside RealmCoordinator::clear_cache() while holding s_coordinator_mutex. Fortunately we don't actually need to be doing that. As the cycle required RealmCoordinator::clear_all_caches(), this was only applicable to tests. * Allow geo coordinate numeric argument substitutions (#6663) * allow geo coordinate numeric argument substitutions * review feedback * explicit cast to address warning * Remove catch() clause to prevent truncating stack trace in AsyncOper::do_recycle_and_execute() (#6667) * Fix an assertion failure if an async write callback ran during a write transaction (#6661) Between when the callback after acquiring the write lock is scheduled and when it's invoked a synchronous write transaction can be begun, and if it's not ended before the next time the scheduler gets to run, the scheduled callback will be invoked inside the write. When this happens we want to just do nothing. Ending the synchronous write transaction will take care of rescheduling the async write it preempted. * core release 13.13.0 * Updated release notes * Allocate arguments for lists (#6674) * Small documentation and code fixes (#6672) * Fix crash when opening FLX realm after client reset failure (#6671) * Fix crash when opening FLX realm after client reset failure * Update changelog * Don't superceed pending subscriptions in case of a client reset failure * Add test * Changes after code review * Support sorting based on values from a dictionary (#5311) Co-authored-by: Sebastian Valle <[email protected]> Co-authored-by: James Stone <[email protected]> * Filter out external sources from Eclipse (#6682) Indexer has a hard time dealing with Catch2 * Use cross-compilers instead of CentOS image (#6559) * Use cross-compilers instead of CentOS image * changelog * fix bad merge * refactor toolchain files * clarify useToolchain exception circumstances * Remap github URL to ssh to fix BAAS dependency using https:// (#6685) * core v13.14.0 * Updated release notes * Switch to building with Xcode 14 (#6647) * better fix explanation in the changelog for list of args in the query parser (#6692) * Remove constructor for GeoPoint and GeoPolygon (#6679) Co-authored-by: Mathias Stearn <[email protected]> * Fix failing "sync: non-synced metadata table doesn't result in non-additive schema change" tests (#6697) * Reporting correct error message on HTTP errors for Browser target * User/Server API key provider becomes a single 'API key' provider (#6696) * Allow frozen Realms to be opened with additive schema changes (#6693) * allow frozen Realms to be opened with additive schema changes * lint * strengthen tests and comments * Update src/realm/object-store/shared_realm.cpp Co-authored-by: Thomas Goyne <[email protected]> --------- Co-authored-by: Thomas Goyne <[email protected]> * Reverted minimum swift version to fix failing CI tests (#6706) * core release v13.15.0 * Updated release notes * Fix client reset test with invalid query (#6711) * Fix SessionWrapper use-after-free crash when tearing down sessions (#6676) * Changed SessionWrapper pointer to bind_ptr; added session ident history * Fix teardown if client is destroyed before session * Session no longer holds bind_ptr to SessionWrapper; reverted some changes * Fixed return and updated some comments * Don't process errors if session is shutting down * Added extra checks for session state * Updates from review * Updated some finalized checks * Rolled back some changes * Added output to ASSERTS and moved session history to unordered_set * Remove session history entry on normal close * Updated comment in sync tests * Add [baas] and [local] tags to object store sync tests to identify the tests that rely on BAAS or not (#6710) * Use Columns<Link> when property is Dictionary of links (#6705) If a Dictionary property has links as value type, we can use Columns<Link> to handle the links instead of the basic Columns<Dictionary>. This has the effect that when we compare with a single value, we will optimize to use LinksToNode. So we need to make LinksToNode handle the Dictionary case. When we compare with a list of links, we must ensure that the list is converted to a list obj ObjKeys - which is the type that Column<Link> evaluates to. Use LinksToNode for lists in QueryParser * better changelog message for the fix related to queries with list of arguments (#6717) * Fixes for Emscripten target (Passing header from fetch response. Using Config.path for inMemory Realm) (#6716) * Fixes for Emscripten target: Passing header for fetch response. Passing the RealmConfig.path to be used for inMemory Realm, this is needed for registering SyncSession Co-authored-by: Jørgen Edelbo <[email protected]> * release 13.15.1 * Updated spec.yml to remove User & Server prefix from ApiKey credentials --------- Co-authored-by: James Stone <[email protected]> Co-authored-by: realm-ci <[email protected]> Co-authored-by: Kirill Burtsev <[email protected]> Co-authored-by: Daniel Tabacaru <[email protected]> Co-authored-by: Christian Melchior <[email protected]> Co-authored-by: Thomas Goyne <[email protected]> Co-authored-by: Thomas Goyne <[email protected]> Co-authored-by: Jørgen Edelbo <[email protected]> Co-authored-by: Michael Wilkerson-Barker <[email protected]> Co-authored-by: Nicola Cabiddu <[email protected]> Co-authored-by: Sebastian Valle <[email protected]> Co-authored-by: Yavor Georgiev <[email protected]> Co-authored-by: Ferdinando Papale <[email protected]> Co-authored-by: Mathias Stearn <[email protected]> Co-authored-by: Nabil Hachicha <[email protected]> Co-authored-by: Finn Schiermer Andersen <[email protected]>
What, How & Why?
There is an order of operations condition around the
SessionWrapper
and theClientImpl::Session
being torn down: since the Session is designed to outlive the SessionWrapper, there is a chance of use after free of the SessionWrapper which would lead to a seg fault. When multiplex sessions is enabled, where the connection outlives the session, there is also the possibility of a pending incoming network message being provided to a session that has already been deactivated.The fixes to address this issue include:
ClientImpl::Connection
to track the history of previous sessions that have been opened so a protocol error is not thrown if a stale binary message is received after the Session has been deactivated.Unfortunately, this issue was hard to test in core, although the SDKs have been able to reproduce the issue more easily. I will continue to investigate adding a test after this is integrated.
Fixes #6656
☑️ ToDos
[ ] 🚦 Tests (or not relevant)[ ] C-API, if public C++ API changed.