Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash when opening FLX realm after client reset failure #6671

Merged
merged 6 commits into from
May 30, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
* Access token refresh for websockets was not updating the location metadata ([#6630](https://github.com/realm/realm-core/issues/6630), since v13.9.3)
* Fix several UBSan failures which did not appear to result in functional bugs ([#6649](https://github.com/realm/realm-core/pull/6649)).
* Fix an out-of-bounds read in sectioned results when sectioned are removed by modifying all objects in that section to no longer appear in that section ([#6649](https://github.com/realm/realm-core/pull/6649), since v13.12.0)
* Fixed a potential crash when opening the realm after failing to download a fresh FLX realm during an automatic client reset ([#6494](https://github.com/realm/realm-core/issues/6494), since v12.3.0)

### Breaking changes
* None.
Expand Down
6 changes: 3 additions & 3 deletions src/realm/object-store/sync/sync_session.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -579,12 +579,12 @@ void SyncSession::handle_fresh_realm_downloaded(DBRef db, Status status,
lock.unlock();
if (auto local_subs_store = get_flx_subscription_store()) {
// In DiscardLocal mode, only the active subscription set is preserved
// this means that we have to remove all other subscriptions including later
// This means that we have to remove all other subscriptions including later
// versioned ones.
auto mut_sub = local_subs_store->get_active().make_mutable_copy();
// Mark the new subscription set Complete because there must always exist an active subscription set.
mut_sub.update_state(sync::SubscriptionSet::State::Complete);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to work fine for this corner case. But I'm a little concerned that we are papering over a real underlying problem here. The subscription is actually not complete, it is in an error state. Is there some invariant of the subscription store that has been violated by setting the active subscription to an error?

If I undo these changes and run the test you added, it looks like the actual problem is that https://github.com/realm/realm-core/blob/master/src/realm/sync/subscriptions.cpp#L727-L730 where an assumption has been made that there is an object with primary key 0. This is probably true in the initial state, but not later on. I wonder if it is possible to hit this in normal flow somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the assumption is that there is always going to be a subscription with pk==0. If that is true, we should not remove it during a superceed state change. What do you think of these changes instead? https://github.com/realm/realm-core/compare/dt/fix_client_reset_crash...js/flx-client-reset?expand=1

Copy link
Collaborator Author

@danieltabacaru danieltabacaru May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the assumption is that there is always going to be a subscription with pk==0 if there is no complete subscription. And that makes sense when opening a fresh realm. But once we have a complete subscription, we supercede and hence remove all the previous ones. I had the same idea to keep around the subscription with the primary key 0, but I think it's wrong. We always supercede subscriptions when one is marked Complete (https://github.com/realm/realm-core/blob/master/src/realm/sync/subscriptions.cpp#L393-L398). If one is marked Error, we should still have the last complete one. And that I think should also be the case here. The new subscription is just a copy of the active/complete one, so it should not be in Error state since there is no issue with it.

Copy link
Collaborator Author

@danieltabacaru danieltabacaru May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I undo these changes and run the test you added, it looks like the actual problem is that https://github.com/realm/realm-core/blob/master/src/realm/sync/subscriptions.cpp#L727-L730 where an assumption has been made that there is an object with primary key 0. This is probably true in the initial state, but not later on. I wonder if it is possible to hit this in normal flow somehow?

I cannot think of any other scenario.

Is there some invariant of the subscription store that has been violated by setting the active subscription to an error?

supercede_prior_to is called only when a subscription is marked complete, while supercede_all_except is the issue here (called only in this specific case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, what you say about removing the pk==0 subscription makes sense, ignore my suggested change for that.
I am just wary of setting the subscription to complete when that is not the actual state of things, I feel that this may create other edge case bugs. I've tried to come up with a scenario where that matters, but am having a hard time.

Now I am questioning if it is correct to be superceeding subscriptions in this way at all. It may be fine for discard local, but I think it may be wrong for recovery mode in a scenario where the server breaks the schema but then changes it back such that FLX queries start to work again. In that situation we wouldn't want to superceed subscriptions made offline because those should be recovered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered why we don't leave the subscriptions as they are..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superceeding the subscriptions will lead to compensating writes. We could superceed them for DiscardLocal and keep them for the recovery mode. I would keep them in all modes (in the worst case we'll create subscriptions for tables or objects that don't exist, but that's not an issue for the server afaik)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, and we should just delete https://github.com/realm/realm-core/blob/master/src/realm/object-store/sync/sync_session.cpp#L580-L589 and nuke the method supercede_all_except(). The comment says that the intent is to remove all later versioned (made offline) subscriptions, but that isn't necessary in the case of a client reset at this point in discard mode and it is plain wrong in recovery mode.
I think you can test the recovery scenario with something like this:

  1. Subscribe to a query A and successfully sync with the server.
  2. Go offline and make a subscription B and add an object that falls under that subscription.
  3. Break the server schema so that subscription A becomes an invalid query and terminate/restart the server.
  4. Client reset with Recovery. This should fail but not remove FLX query B
  5. Break the server schema again such that subscription A becomes valid again and terminate/restart the server.
  6. Reconnect the client, it should do a client reset with recovery successfully and subscription B is now active and the object added in step 2 is uploaded to the server.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test suggestion. I'll give it a try.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at https://github.com/realm/realm-core/blob/master/test/object-store/sync/flx_sync.cpp#L754-L800 for inspiration, but the outcome is not always deterministic. So, I added a new test based on the other test I added for this PR to test for this specific case.

local_subs_store->supercede_all_except(mut_sub);
mut_sub.update_state(sync::SubscriptionSet::State::Error,
util::make_optional<std::string_view>(status.reason()));
std::move(mut_sub).commit();
}
const bool try_again = false;
Expand Down
2 changes: 2 additions & 0 deletions src/realm/sync/subscriptions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -903,6 +903,8 @@ void SubscriptionStore::supercede_prior_to(TransactionRef tr, int64_t version_id

void SubscriptionStore::supercede_all_except(MutableSubscriptionSet& mut_sub) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

{
// 'mut_sub' can only supersede the other subscription sets if it is in Complete state.
REALM_ASSERT_EX(mut_sub.state() == SubscriptionSet::State::Complete, mut_sub.state());
auto version_to_keep = mut_sub.version();
supercede_prior_to(mut_sub.m_tr, version_to_keep);

Expand Down
31 changes: 31 additions & 0 deletions test/object-store/sync/flx_sync.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
#include "realm/sync/client_base.hpp"
#include "realm/sync/config.hpp"
#include "realm/sync/noinst/client_history_impl.hpp"
#include <realm/sync/noinst/client_reset_operation.hpp>
#include "realm/sync/noinst/pending_bootstrap_store.hpp"
#include "realm/sync/noinst/server/access_token.hpp"
#include "realm/sync/protocol.hpp"
Expand Down Expand Up @@ -811,6 +812,36 @@ TEST_CASE("flx: client reset", "[sync][flx][app][client reset]") {
})
->run();
}

SECTION("DiscardLocal: open realm after client reset failure") {
config_local.sync_config->client_resync_mode = ClientResyncMode::DiscardLocal;
auto&& [error_future, error_handler] = make_error_handler();
config_local.sync_config->error_handler = error_handler;

std::string fresh_path = realm::_impl::ClientResetOperation::get_fresh_path_for(config_local.path);
// create a non-empty directory that we'll fail to delete
util::make_dir(fresh_path);
util::File(util::File::resolve("file", fresh_path), util::File::mode_Write);

auto test_reset = reset_utils::make_baas_flx_client_reset(config_local, config_remote, harness.session());
test_reset->run();

// Client reset fails due to sync client not being able to create the fresh realm.
auto sync_error = wait_for_future(std::move(error_future)).get();
CHECK(sync_error.get_system_error() == sync::make_error_code(sync::ClientError::auto_client_reset_failure));

// Open the realm again. This should not crash.
{
auto config_copy = config_local;
auto&& [err_future, err_handler] = make_error_handler();
config_local.sync_config->error_handler = err_handler;

auto realm_post_reset = Realm::get_shared_realm(config_copy);
auto sync_error = wait_for_future(std::move(err_future)).get();
CHECK(sync_error.get_system_error() ==
sync::make_error_code(sync::ClientError::auto_client_reset_failure));
}
}
}

TEST_CASE("flx: creating an object on a class with no subscription throws", "[sync][flx][app]") {
Expand Down