-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite the local changesets in-place for client reset recovery #7161
Conversation
5693ccc
to
8015adb
Compare
Pull Request Test Coverage Report for Build thomas.goyne_131
💛 - Coveralls |
462f034
to
950ec59
Compare
8b15358
to
6c31b99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements across the board 👍
It is a lot of code changes in some dusty corners of the codebase, but I went through all the commits and it looks good to me. Thanks for taking this on, having the FLX recovery as a single commit is a huge win!
// By marking version 2 as complete version 1 will get superceded and removed. | ||
CHECK_THROW(store->get_mutable_by_version(1), KeyNotFound); | ||
// By marking version 2 as complete version 1 will get superseded and removed. | ||
CHECK_EQUAL(store->get_by_version(1).state(), SubscriptionSet::State::Superseded); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1 is indeed superseded but not removed in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"superseded and removed" is a redundant phrase, as we mark things superseded by removing them. The comment is probably misleading, but the difference in the test is just due to get_by_version()
and get_mutable_by_version()
reporting a superseded version in different ways.
The exposed interface for this type is just a function, so we can slightly improve build time/size by not exposing any of the implementation details.
The MutableSubscriptionSet returned by this didn't support any operations other than `update_state()`, so pushing `update_state()` to `SubscriptionStore` and removing `get_mutable_by_version()` makes it less error-prone.
Rather than discarding the old sync history and creating new commits, update the individual changesets which are being recovered in-place while keeping the original version numbers. This significantly simplifies recovery when pending subscriptions are involved, as they no longer need to be updated at all, and it lets us apply the client reset in a single write transaction, which improves notifications and avoids a lot of problems if another thread has the Realm open while the client reset is happening. It also avoids causing problems for the server, which doesn't like overly large changesets, and sometimes ran into problems due to all of the pending local changesets being merged into one.
…eset on an async open If the very first open of a flexible sync Realm triggered a client reset, the configuration had an initial subscriptions callback, both before and after reset callbacks, and the initial subscription callback began a read transaction without ending it (which is normally going to be the case), opening the frozen Realm for the after reset callback would trigger a BadVersion exception. The commit to create the initial subscriptions resulted in the `before` Realm being one version of out date, so once it was closed and the read lock released that VersionID was no longer valid and could not be used to obtain new transactions.
These tests were a giant race condition that only worked with fairly specific timing from the server, and in practice didn't actually test the situation they were intending to test.
6c31b99
to
a1808dd
Compare
If DOWNLOAD messages were received while there were unuploaded local changes prior to the reset, the reciprocal history will be more up-to-date than the original changesets and be more likely to recover correctly. If not, this has exactly the old behavior.
I've switched it over to recovering using the reciprocal history, made it clear the reciprocal history afterwards, and added a test showing why each of those is needed. |
@@ -973,4 +971,25 @@ int64_t SubscriptionStore::set_active_as_latest(Transaction& wt) | |||
return version; | |||
} | |||
|
|||
int64_t SubscriptionStore::mark_active_as_complete(Transaction& wt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could make this a no-op if the state is already Complete (unless it's possible to set the state and not deliver the notifications)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the active set is already complete then m_pending_notifications
should be empty barring someone waiting for a Superseded notification, so the only unnecessary thing we're doing in that scenario is acquiring the lock.
@@ -507,39 +507,16 @@ void wait_for_num_objects_in_atlas(std::shared_ptr<SyncUser> user, const AppSess | |||
std::chrono::minutes(15), std::chrono::milliseconds(500)); | |||
} | |||
|
|||
void trigger_client_reset(const AppSession& app_session) | |||
void trigger_client_reset(const AppSession& app_session, const SyncSession& sync_session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need to set development mode back and wait for initial sync to complete as before? (is the client reset endpoint doing all that?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client reset endpoint just directly triggers a client reset for a specific client file ident without disabling sync, so none of that is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. You could mention using the reciprocal history in the PR's description.
Rather than discarding the old sync history and creating new commits, update the individual changesets which are being recovered in-place while keeping the original version numbers. This significantly simplifies recovery when pending subscriptions are involved, as they no longer need to be updated at all, and it lets us apply the client reset in a single write transaction, which improves notifications and avoids a lot of problems if another thread has the Realm open while the client reset is happening.
It also avoids causing problems for the server, which doesn't like overly large changesets, and sometimes ran into problems due to all of the pending local changesets being merged into one.
"Move RecoverLocalChangesetsHandler fully to the cpp file" is just moving code around with no changes, and the functional changes to client_reset_recovery.cpp are entirely in the final commit.
"Remove SubscriptionStore::get_mutable_by_version()" is adjusting the SubscriptionStore API to eliminate a problem I ran into while writing some tests: the MutableSubscriptionSet it returns doesn't actually support modifying the subscription set. This is actually the correct behavior, as existing subscription sets shouldn't be modified, but it was a misleading interface. I changed it to only expose
update_state()
rather than appearing to allow more than it did.The client reset during async open tests started hitting some very unintended codepaths and started crashing. AFAICT the change in timing resulted in it triggering two client resets, as it relied on stopping sync in the middle of the server sending a batch having some very specific behavior that wasn't actually guaranteed. I updated it to use the client reset endpoint that all other tests use and to immediately disconnect rather than relying on the server disconnecting it. This revealed that the FLX test had failed to test the thing it was trying to test and that very specific scenario was broken.