Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SessionWrapper use-after-free crash when tearing down sessions #6676

Merged
merged 23 commits into from
Jun 13, 2023

Conversation

michael-wb
Copy link
Contributor

@michael-wb michael-wb commented May 26, 2023

What, How & Why?

There is an order of operations condition around the SessionWrapper and the ClientImpl::Session being torn down: since the Session is designed to outlive the SessionWrapper, there is a chance of use after free of the SessionWrapper which would lead to a seg fault. When multiplex sessions is enabled, where the connection outlives the session, there is also the possibility of a pending incoming network message being provided to a session that has already been deactivated.
The fixes to address this issue include:

  • Added finalized flag in SessionWrapper and added extra checks to make sure functions that rely on the SessionWrapper state aren't called after it has been finalized.
  • Added session ident history to ClientImpl::Connection to track the history of previous sessions that have been opened so a protocol error is not thrown if a stale binary message is received after the Session has been deactivated.
  • Added checks to the session binary message processing to ignore all incoming messages, other than UNBOUND, if the Session is not in the 'Active' state.

Unfortunately, this issue was hard to test in core, although the SDKs have been able to reproduce the issue more easily. I will continue to investigate adding a test after this is integrated.

Fixes #6656

☑️ ToDos

  • 📝 Changelog update
  • [ ] 🚦 Tests (or not relevant)
  • [ ] C-API, if public C++ API changed.

@cla-bot cla-bot bot added the cla: yes label May 26, 2023
@michael-wb michael-wb self-assigned this May 26, 2023
@danieltabacaru
Copy link
Collaborator

SyncTests are failing.

@michael-wb
Copy link
Contributor Author

SyncTests are failing.

Should be fixed now: a test was not working as originally intended which resulted in tearing down the client before the session, which resulted in both finalize_before_actualization() and finalize() being called. That test was fixed and another test was added to test against the original failures when the session was torn down after the client was stopped.

@jbreams
Copy link
Contributor

jbreams commented May 26, 2023 via email

@michael-wb
Copy link
Contributor Author

michael-wb commented May 26, 2023

Thanks for the feedback @jbreams - with session multiplexing, we were getting download and error messages from the server that were still pending in the network socket after the SessionWrapper had been destroyed, which led to crashes during the SDK testing of the Core build with multiplexing always enabled.
I've removed the bind_ptr reference from Session, so we don't extend the lifetime of the object and added more checks to ensure the session wrapper isn't used while the Session is being deactivated.

@@ -1448,6 +1475,10 @@ void Connection::receive_ident_message(session_ident_type session_ident, SaltedF
{
Session* sess = get_session(session_ident);
if (REALM_UNLIKELY(!sess)) {
if (check_session_history(session_ident)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? can't we check the state of the session?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if the session is not active anymore it's just a no-op

Copy link
Collaborator

@danieltabacaru danieltabacaru May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I see we don't have a session in this case.. Out of curiosity, is the original error (and closing the connection) an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really isn't a client error if there is a pending network message that is received after the session has been deactivated, so there is no need to close the connection and restart any other sessions that are currently active. In this case we just print message (maybe it should be an info and not an error?) and just ignore it.
This is really only needed if session multiplexing is enabled, and doesn't matter if it is disabled, since there will only ever be one session per connection.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I have a suggestion (feel free to ignore). We could change close_due_to_* (actually disconnect()) to use m_sessions (or similar) to decide if it's the last session wanting to disconnect, and only then actually close the connection (a bit similar to the logic in one_more_active_unsuspended_session). We then don't need the new check (and error message), although I agree the new error message makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - but I think it would be best in a separate PR.

@michael-wb michael-wb marked this pull request as draft May 30, 2023 03:38
src/realm/sync/client.cpp Outdated Show resolved Hide resolved
@@ -1375,13 +1382,28 @@ void Connection::receive_pong(milliseconds_type timestamp)
m_client.m_roundtrip_time_handler(m_previous_ping_rtt); // Throws
}

bool Connection::check_session_history(session_ident_type session_ident)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the actual bug fix, and the m_finalized changes are just to catch invalid states more reliably? I definitely don't understand this code well enough to meaningfully review it, but it superficially looks like it might be replacing a UAF on other data members with a UAF on m_finalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent any pending messages in the websocket from closing the connection if they come in after the session has been closed. The other part of the fix is to not allow Session to call functions on the SessionWrapper reference if the session is currently deactivated or being deactivated

src/realm/sync/noinst/client_impl_base.cpp Outdated Show resolved Hide resolved
@michael-wb
Copy link
Contributor Author

I am looking at the failing "sync: non-synced metadata table doesn't result in non-additive schema changes" test, but I don't believe these failures are related.

@michael-wb michael-wb marked this pull request as ready for review June 5, 2023 20:13
return;
else if (!status.is_ok())
throw Exception(status);
if (!m_finalized) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this check be done inside the callback below instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We technically don't need it, since we're checking if m_sess == nullptr before calling on the session pointer

@michael-wb michael-wb requested a review from ironage June 8, 2023 05:47
@@ -1375,16 +1382,33 @@ void Connection::receive_pong(milliseconds_type timestamp)
m_client.m_roundtrip_time_handler(m_previous_ping_rtt); // Throws
}

Session* Connection::find_and_validate_session(session_ident_type session_ident, std::string_view message) noexcept
{
if (session_ident != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use an early return here to avoid extra indentation

if (!m_force_closed) {
REALM_ASSERT(m_sess);
ClientImpl::Connection& conn = m_sess->get_connection();
conn.initiate_session_deactivation(m_sess); // Throws

// Delete the pending bootstrap store since it uses a reference to the logger in m_sess
m_flx_pending_bootstrap_store.reset();
// Clear the subscription and migration store refs since they are owned by SyncSession
m_flx_subscription_store.reset();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

REALM_ASSERT(m_actualized);
REALM_ASSERT(m_sess);
m_force_closed = true;
if (!m_force_closed && !m_finalized) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use an early return

return;
else if (!status.is_ok())
throw Exception(status);
if (REALM_LIKELY(!m_finalized && !m_force_closed)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

return;
else if (!status.is_ok())
throw Exception(status);
if (REALM_LIKELY(!m_finalized && !m_force_closed)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

else if (!status.is_ok())
throw Exception(status);
// Includes operation_aborted
if (!status.is_ok())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@@ -628,6 +629,9 @@ class ClientImpl::Connection {
// The set of sessions associated with this connection. A session becomes
// associated with a connection when it is activated.
std::map<session_ident_type, std::unique_ptr<Session>> m_sessions;
// Keep track of previously used sessions idents to see if a stale message was
// received for a closed session
std::vector<session_ident_type> m_session_history;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an error to receive any message after an ERROR or UNBOUND. Hence, you could use an std::unordered_set here and remove the session ident when one of the messages above is received. I think this also helps if the session is revived (the ident would be added again..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea - if the client receives either of those from the server, it should never see another message for that session ident. The issue being addressed is when the client closes the session, either through the normal process of sending UNBIND or the Session is force closed. There could still be a complete message waiting to be processed in the event loop that could be passed back to the Connection after the Session has been deactivated and we don't want to cause the Connection to be torn down if session multiplexing is being used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only "special" case is when the session is force closed (i.e, there is no UNBIND -> UNBOUND exchange), so in that case the session ident would not be removed. But I think it's worth to do the clean-up in all the other cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The session history entry will be removed when the session is closed via the normal path - abrupt closures (e.g. force_close or terminate) will leave the session history in place after the session is deactivated, and this is likely where the stale messages were being received.

server.stop();
server_thread.join();

// The check is that we reach this point without deadlocking.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there be a deadlock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a copy/paste from the test above that was duplicated. Really it should be "deadlocking or crash" - the description of the test describes what this test is intending to verify: a SyncSession in the process of starting up when the SyncClient is torn down. I will update this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comments

Copy link
Contributor

@ironage ironage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but this is not my area of expertise.

return m_wrapper.m_virt_path;
}

const std::string& SessionImpl::get_realm_path() const noexcept
{
// Can only be called if the session is active or being activated
REALM_ASSERT(m_state == State::Active || m_state == State::Unactivated);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great to see these extra assertions! Not blocking, but where you can, I prefer to add a bit of info to help debugging in case they are violated.

Suggested change
REALM_ASSERT(m_state == State::Active || m_state == State::Unactivated);
REALM_ASSERT_EX(m_state == State::Active || m_state == State::Unactivated, m_state);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I agree it will be nice to have better debugging. I will update these.

@@ -1775,15 +1901,16 @@ void SessionWrapper::report_progress()
util::Future<std::string> SessionWrapper::send_test_command(std::string body)
{
if (!m_sess) {
return util::Future<std::string>::make_ready(
Status{ErrorCodes::RuntimeError, "session must be activated to send a test command"});
return Status{ErrorCodes::RuntimeError, "session must be activated to send a test command"};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need a Future here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A future will be "copy constructed" in the return using the Status value. The Status copy constructor in the Future class calls make_ready using the status value provided.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. just noticed the implicit constructor

@michael-wb michael-wb merged commit 1832986 into master Jun 13, 2023
@michael-wb michael-wb deleted the mwb/fix-multiplex-sessions-crash branch June 13, 2023 13:48
kraenhansen added a commit that referenced this pull request Jun 19, 2023
* Updated release notes

* Update catch2 v3.3.2 (#6605)

* Make core infer platform and cpu_arch, while bundle_id must be provided by SDK's (#6612)

* platform and cpu_arch are inferred by core, bundle_id must be provided by SDK's

* update changelog

* Return proper value for X86_64 arch

Co-authored-by: Christian Melchior <[email protected]>

* Get fine-grained platform for Apple devices

* Fix tests

* small fixes

* fix more tests

* Fix mistake in changelog

---------

Co-authored-by: Christian Melchior <[email protected]>

* use consistent rounding, following SERVER-75392 (#6477)

* fix entries that went to the wrong change version (#6632)

* Special-case main thread runloop scheduler

* Improve SectionedResults performance

SectionedResults used a std::map in a few places where the keys are a dense
range (i.e. always [0..map.size())) and so they can be std::vector instead. The
maps keyed on Mixed are now std::unordered_map.

Change notifications now report changes as a `std::vector<IndexSet>` rather
than `std::map<size_t, IndexSet>`. This is slower and uses more memory when the
only sections that changed are near the end of a SectionedResults with a large
number of sections, but is much faster if all sections changed or if the
sections which changed are early in the SectionedResults. Change notifications
now reuse buffers, which increases persistent memory usage slightly but
significant reduces allocations.

Change notifications for a single section now only compute the changes for that
section rather than computing the full changes and then filtering out the
changes for other sections.

* use static_assert rather than a old home rolled one

* fix warning of redefine of CHECK macro

* fix unused function warning

* silence warnings in bid128_to_string

* Introduce BPlusTree::for_all

* Prevent program from crashing when removing backlinks

* Fix broken snapshot of collection of objects

* Fix importing Results with deleted collection

The result should be an empty result, not the whole table.

* geospatial validation of polygons (#6607)

* geospatial validation of polygons

* Loop->Ring, added tests

* use std::unique

* changelog

* Benchmark for full-text search

* Allow to filter benchmark and run only list of specified names
* Add simple benchmark for fulltext search with index

* Filter out unresolved links in Dictionary::get_any()

* Add support for early exit in BPlusTree::for_all()

* Geospatial feedback (#6645)

* verify local results match a server query

* disallow geowithin on top level tables

* fix geo queries with ANY/ALL/NONE

* geospatial validation of points

* rename GeoCenterSphere -> GeoCircle

* review feedback

* better testing and fix any/all/none geospatial

* format

* Geospatial basic queries benchmarks (#6621)

* Add basic benchmarks for Geospatial type and queries

* Less copying in GeoWithinCompare

* Bring back caching of s2 region into Geospatial

* remove transaction overhead from measurements

* a couple small optimizations

* formatting

* simplify geospatial query evaluations

* changelog

---------

Co-authored-by: James Stone <[email protected]>

* Updated baas server tag for CI (#6650)

* Prepare release

* Updated release notes

* Access token refresh for websockets was not updating the location metadata (#6631)

* Always refresh metadata on app login
* Updated changelog
* Always update location when requested; fix c_api test
* Update test to properly evaluate websocket redirections; added one more test
* Updated changelog and fixed compile warning
* Added location checks back to test
* added mutex locking around location updated state and reworked requesting location update to use flag
* clang format and fix incorrect timeout value
* Reworked update location logic a bit and removed unused function
* Free mutex before calling completion on early exit in init_app_metadata

* maybe fix a race in a test (#6651)

* Use std::optional to store cached leaves in query nodes (#6653)

Our use of aligned_storage was basically a complicated manual version of this.
I was hoping this'd have binary size benefits, but it ended up making the
library 100 bytes larger instead. Nonetheless, it greatly simplifies things.

* Fix a few UBSan failures hit by tests

* Avoid performing unaligned reads in Array::get_chunk()

* Fix a lock order inversion in tests (#6666)

The cycle was DaemonThread::m_running_on_change_mutex =>
RealmCoordinator::m_realm_mutex  => SyncManager::m_mutex  =>
RealmCoordinator::s_coordinator_mutex  =>
DaemonThread::m_running_on_change_mutex, and it happened due to
DaemonThread::remove() being called inside RealmCoordinator::clear_cache()
while holding s_coordinator_mutex. Fortunately we don't actually need to be doing that.

As the cycle required RealmCoordinator::clear_all_caches(), this was only
applicable to tests.

* Allow geo coordinate numeric argument substitutions (#6663)

* allow geo coordinate numeric argument substitutions

* review feedback

* explicit cast to address warning

* Remove catch() clause to prevent truncating stack trace in AsyncOper::do_recycle_and_execute() (#6667)

* Fix an assertion failure if an async write callback ran during a write transaction (#6661)

Between when the callback after acquiring the write lock is scheduled and when
it's invoked a synchronous write transaction can be begun, and if it's not
ended before the next time the scheduler gets to run, the scheduled callback
will be invoked inside the write. When this happens we want to just do nothing.
Ending the synchronous write transaction will take care of rescheduling the
async write it preempted.

* core release 13.13.0

* Updated release notes

* Allocate arguments for lists (#6674)

* Small documentation and code fixes (#6672)

* Fix crash when opening FLX realm after client reset failure (#6671)

* Fix crash when opening FLX realm after client reset failure

* Update changelog

* Don't superceed pending subscriptions in case of a client reset failure

* Add test

* Changes after code review

* Support sorting based on values from a dictionary (#5311)

Co-authored-by: Sebastian Valle <[email protected]>
Co-authored-by: James Stone <[email protected]>

* Filter out external sources from Eclipse (#6682)

Indexer has a hard time dealing with Catch2

* Use cross-compilers instead of CentOS image (#6559)

* Use cross-compilers instead of CentOS image

* changelog

* fix bad merge

* refactor toolchain files

* clarify useToolchain exception circumstances

* Remap github URL to ssh to fix BAAS dependency using https:// (#6685)

* core v13.14.0

* Updated release notes

* Switch to building with Xcode 14 (#6647)

* better fix explanation in the changelog for list of args in the query parser (#6692)

* Remove constructor for GeoPoint and GeoPolygon (#6679)

Co-authored-by: Mathias Stearn <[email protected]>

* Fix failing "sync: non-synced metadata table doesn't result in non-additive schema change" tests (#6697)

* Reporting correct error message on HTTP errors for Browser target

* User/Server API key provider becomes a single 'API key' provider (#6696)

* Allow frozen Realms to be opened with additive schema changes (#6693)

* allow frozen Realms to be opened with additive schema changes

* lint

* strengthen tests and comments

* Update src/realm/object-store/shared_realm.cpp

Co-authored-by: Thomas Goyne <[email protected]>

---------

Co-authored-by: Thomas Goyne <[email protected]>

* Reverted minimum swift version to fix failing CI tests (#6706)

* core release v13.15.0

* Updated release notes

* Fix client reset test with invalid query (#6711)

* Fix SessionWrapper use-after-free crash when tearing down sessions (#6676)

* Changed SessionWrapper pointer to bind_ptr; added session ident history
* Fix teardown if client is destroyed before session
* Session no longer holds bind_ptr to SessionWrapper; reverted some changes
* Fixed return and updated some comments
* Don't process errors if session is shutting down
* Added extra checks for session state
* Updates from review
* Updated some finalized checks
* Rolled back some changes
* Added output to ASSERTS and moved session history to unordered_set
* Remove session history entry on normal close
* Updated comment in sync tests

* Add [baas] and [local] tags to object store sync tests to identify the tests that rely on BAAS or not (#6710)

* Use Columns<Link> when property is Dictionary of links (#6705)

If a Dictionary property has links as value type, we can use Columns<Link> to handle
the links instead of the basic Columns<Dictionary>. This has the effect that when we
compare with a single value, we will optimize to use LinksToNode. So we need to make
LinksToNode handle the Dictionary case.

When we compare with a list of links, we must ensure that the list is converted to
a list obj ObjKeys - which is the type that Column<Link> evaluates to.

 Use LinksToNode for lists in QueryParser

* better changelog message for the fix related to queries with list of arguments (#6717)

* Fixes for Emscripten target (Passing header from fetch response. Using Config.path for inMemory Realm) (#6716)

* Fixes for Emscripten target: Passing header for fetch response. Passing the RealmConfig.path to be used for inMemory Realm, this is needed for registering SyncSession

Co-authored-by: Jørgen Edelbo <[email protected]>

* release 13.15.1

* Updated spec.yml to remove User & Server prefix from ApiKey credentials

---------

Co-authored-by: James Stone <[email protected]>
Co-authored-by: realm-ci <[email protected]>
Co-authored-by: Kirill Burtsev <[email protected]>
Co-authored-by: Daniel Tabacaru <[email protected]>
Co-authored-by: Christian Melchior <[email protected]>
Co-authored-by: Thomas Goyne <[email protected]>
Co-authored-by: Thomas Goyne <[email protected]>
Co-authored-by: Jørgen Edelbo <[email protected]>
Co-authored-by: Michael Wilkerson-Barker <[email protected]>
Co-authored-by: Nicola Cabiddu <[email protected]>
Co-authored-by: Sebastian Valle <[email protected]>
Co-authored-by: Yavor Georgiev <[email protected]>
Co-authored-by: Ferdinando Papale <[email protected]>
Co-authored-by: Mathias Stearn <[email protected]>
Co-authored-by: Nabil Hachicha <[email protected]>
Co-authored-by: Finn Schiermer Andersen <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiplexing seems to periodically crash the SyncClient during Client Reset
5 participants