Rewrite the local changesets in-place for client reset recovery #7161

tgoyne · 2023-11-22T17:26:15Z

Rather than discarding the old sync history and creating new commits, update the individual changesets which are being recovered in-place while keeping the original version numbers. This significantly simplifies recovery when pending subscriptions are involved, as they no longer need to be updated at all, and it lets us apply the client reset in a single write transaction, which improves notifications and avoids a lot of problems if another thread has the Realm open while the client reset is happening.

It also avoids causing problems for the server, which doesn't like overly large changesets, and sometimes ran into problems due to all of the pending local changesets being merged into one.

"Move RecoverLocalChangesetsHandler fully to the cpp file" is just moving code around with no changes, and the functional changes to client_reset_recovery.cpp are entirely in the final commit.

"Remove SubscriptionStore::get_mutable_by_version()" is adjusting the SubscriptionStore API to eliminate a problem I ran into while writing some tests: the MutableSubscriptionSet it returns doesn't actually support modifying the subscription set. This is actually the correct behavior, as existing subscription sets shouldn't be modified, but it was a misleading interface. I changed it to only expose update_state() rather than appearing to allow more than it did.

The client reset during async open tests started hitting some very unintended codepaths and started crashing. AFAICT the change in timing resulted in it triggering two client resets, as it relied on stopping sync in the middle of the server sending a batch having some very specific behavior that wasn't actually guaranteed. I updated it to use the client reset endpoint that all other tests use and to immediately disconnect rather than relying on the server disconnecting it. This revealed that the FLX test had failed to test the thing it was trying to test and that very specific scenario was broken.

coveralls-official · 2023-11-22T18:15:51Z

Pull Request Test Coverage Report for Build thomas.goyne_131

835 of 865 (96.53%) changed or added relevant lines in 17 files are covered.
69 unchanged lines in 12 files lost coverage.
Overall coverage increased (+0.06%) to 91.748%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/realm/object-store/sync/sync_session.cpp	32	33	96.97%
test/object-store/sync/client_reset.cpp	53	56	94.64%
test/object-store/sync/flx_sync.cpp	101	104	97.12%
src/realm/sync/subscriptions.cpp	84	95	88.42%
src/realm/sync/noinst/client_reset_recovery.cpp	51	63	80.95%

Files with Coverage Reduction	New Missed Lines	%
src/realm/array_key.cpp	1	97.53%
src/realm/object-store/sync/sync_session.cpp	1	93.38%
src/realm/sync/noinst/client_impl_base.cpp	1	85.62%
test/object-store/sync/client_reset.cpp	1	98.46%
test/test_dictionary.cpp	1	99.85%
src/realm/util/file.cpp	2	81.73%
src/realm/sync/subscriptions.cpp	3	94.49%
test/object-store/sync/flx_sync.cpp	3	98.48%
src/realm/sync/instruction_applier.cpp	4	70.26%
src/realm/unicode.cpp	4	90.15%

Totals
Change from base Build 1878:	0.06%
Covered Lines:	232156
Relevant Lines:	253037

💛 - Coveralls

ironage

Nice improvements across the board 👍
It is a lot of code changes in some dusty corners of the codebase, but I went through all the commits and it looks good to me. Thanks for taking this on, having the FLX recovery as a single commit is a huge win!

src/realm/sync/subscriptions.cpp

danieltabacaru · 2023-12-04T13:44:29Z

test/test_sync_subscriptions.cpp

-        // By marking version 2 as complete version 1 will get superceded and removed.
-        CHECK_THROW(store->get_mutable_by_version(1), KeyNotFound);
+        // By marking version 2 as complete version 1 will get superseded and removed.
+        CHECK_EQUAL(store->get_by_version(1).state(), SubscriptionSet::State::Superseded);


v1 is indeed superseded but not removed in this case?

"superseded and removed" is a redundant phrase, as we mark things superseded by removing them. The comment is probably misleading, but the difference in the test is just due to get_by_version() and get_mutable_by_version() reporting a superseded version in different ways.

src/realm/sync/noinst/client_history_impl.cpp

The exposed interface for this type is just a function, so we can slightly improve build time/size by not exposing any of the implementation details.

The MutableSubscriptionSet returned by this didn't support any operations other than `update_state()`, so pushing `update_state()` to `SubscriptionStore` and removing `get_mutable_by_version()` makes it less error-prone.

Rather than discarding the old sync history and creating new commits, update the individual changesets which are being recovered in-place while keeping the original version numbers. This significantly simplifies recovery when pending subscriptions are involved, as they no longer need to be updated at all, and it lets us apply the client reset in a single write transaction, which improves notifications and avoids a lot of problems if another thread has the Realm open while the client reset is happening. It also avoids causing problems for the server, which doesn't like overly large changesets, and sometimes ran into problems due to all of the pending local changesets being merged into one.

…eset on an async open If the very first open of a flexible sync Realm triggered a client reset, the configuration had an initial subscriptions callback, both before and after reset callbacks, and the initial subscription callback began a read transaction without ending it (which is normally going to be the case), opening the frozen Realm for the after reset callback would trigger a BadVersion exception. The commit to create the initial subscriptions resulted in the `before` Realm being one version of out date, so once it was closed and the read lock released that VersionID was no longer valid and could not be used to obtain new transactions.

These tests were a giant race condition that only worked with fairly specific timing from the server, and in practice didn't actually test the situation they were intending to test.

If DOWNLOAD messages were received while there were unuploaded local changes prior to the reset, the reciprocal history will be more up-to-date than the original changesets and be more likely to recover correctly. If not, this has exactly the old behavior.

tgoyne · 2023-12-05T05:14:14Z

I've switched it over to recovering using the reciprocal history, made it clear the reciprocal history afterwards, and added a test showing why each of those is needed.

danieltabacaru · 2023-12-05T07:51:39Z

src/realm/sync/subscriptions.cpp

@@ -973,4 +971,25 @@ int64_t SubscriptionStore::set_active_as_latest(Transaction& wt)
    return version;
 }

+int64_t SubscriptionStore::mark_active_as_complete(Transaction& wt)


you could make this a no-op if the state is already Complete (unless it's possible to set the state and not deliver the notifications)

I think if the active set is already complete then m_pending_notifications should be empty barring someone waiting for a Superseded notification, so the only unnecessary thing we're doing in that scenario is acquiring the lock.

danieltabacaru · 2023-12-05T10:03:18Z

test/object-store/util/sync/sync_test_utils.cpp

@@ -507,39 +507,16 @@ void wait_for_num_objects_in_atlas(std::shared_ptr<SyncUser> user, const AppSess
        std::chrono::minutes(15), std::chrono::milliseconds(500));
 }

-void trigger_client_reset(const AppSession& app_session)
+void trigger_client_reset(const AppSession& app_session, const SyncSession& sync_session)


don't we need to set development mode back and wait for initial sync to complete as before? (is the client reset endpoint doing all that?)

The client reset endpoint just directly triggers a client reset for a specific client file ident without disabling sync, so none of that is needed.

danieltabacaru

LGTM. You could mention using the reciprocal history in the PR's description.

tgoyne self-assigned this Nov 22, 2023

tgoyne force-pushed the tg/in-place-client-reset branch from 5693ccc to 8015adb Compare November 22, 2023 17:40

tgoyne force-pushed the tg/in-place-client-reset branch 7 times, most recently from 462f034 to 950ec59 Compare November 29, 2023 02:19

tgoyne mentioned this pull request Nov 29, 2023

Missing notifications after client reset. #7065

Open

tgoyne force-pushed the tg/in-place-client-reset branch 2 times, most recently from 8b15358 to 6c31b99 Compare November 29, 2023 19:57

tgoyne requested review from ironage and danieltabacaru November 29, 2023 21:28

tgoyne marked this pull request as ready for review November 29, 2023 21:28

ironage approved these changes Nov 30, 2023

View reviewed changes