Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCORE-1872 Sync client should allow server bootstrapping at any time #7440

Merged
merged 71 commits into from
Jun 11, 2024

Conversation

michael-wb
Copy link
Contributor

@michael-wb michael-wb commented Mar 8, 2024

What, How & Why?

Add the ability for the server to send a bootstrap to the server at any time, initially needed to support the handle role changes without a client reset. With these changes a single message/single changeset bootstrap for the current query version will be applied like a regular download message, however a multi-message or single message/multi changeset bootstrap for the current query version will be applied like a regular bootstrap.

These changes also include a test to validate the role change bootstrap and verify the local data is updated to reflect the new permissions. Additional tests are forthcoming in a couple of future PRs.

The version of baasaas and baas used with these changes are based off a branch that is pending review and the protocol version was bumped to v14 in order to support these changes without the feature flag.

Fixes #7584, #7326

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • [ ] C-API, if public C++ API changed
  • [ ] bindgen/spec.yml, if public C++ API changed

@cla-bot cla-bot bot added the cla: yes label Mar 8, 2024
@michael-wb michael-wb changed the title Updated bootstrap store to handle server initiated bootstraps RCORE-1872 Sync client should allow server bootstrapping at any time Mar 8, 2024
@michael-wb michael-wb force-pushed the mwb/role-change-bootstraps branch from 7ec0cc0 to ae2574e Compare April 3, 2024 19:07
@michael-wb michael-wb marked this pull request as ready for review April 12, 2024 00:57
@michael-wb michael-wb requested a review from jbreams April 16, 2024 13:12
TEST_CASE("flx: role change bootstrap", "[sync][flx][baas][role_change][bootstrap]") {
const Schema person_schema{{"Person",
{{"_id", PropertyType::ObjectId, Property::IsPrimary{true}},
{"age", PropertyType::Int | PropertyType::Nullable},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the only field in this schema that actually matter are "role". Do we need to make up fake age and firstName/lastName properties? how about just a role and a name property so each object is {_id: ObjectId(), role: "manager", name: "manager-1" }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to remove the extra fields, but actually, I need the extra data so I can ensure adding the employees back to the local data generates multiple bootstrap messages.

test/object-store/sync/flx_sync.cpp Outdated Show resolved Hide resolved
test/object-store/sync/flx_sync.cpp Outdated Show resolved Hide resolved
auto logger = util::Logger::get_default_logger();
FLXSyncTestHarness harness("flx_role_change_bootstrap", {person_schema, {"role", "firstName", "lastName"}});
auto& app_session = harness.session().app_session();
// Enable the role change bootstraps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They currently don't work with baasaas and generate an error...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that because this depends on https://jira.mongodb.org/browse/BAAS-28604, or something else?
If so, can you reference that in here like TODO (BAAS-28604): ... and be sure to link that ticket to RCORE-1872?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this code depends on using the feature flag instead of the protocol version for now. Once the feature is released, the client and server will coordinate the version bump when it will be merged from the feature branch to master. I added a comment to reflect this.

test/object-store/sync/flx_sync.cpp Outdated Show resolved Hide resolved
test/object-store/sync/flx_sync.cpp Outdated Show resolved Hide resolved
test/object-store/sync/flx_sync.cpp Outdated Show resolved Hide resolved
multi_msg = true;
}
// Verify a bootstrap occurred if multiple messages were received
REQUIRE((!multi_msg || machina.get() == TestState::bs_complete));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to wait for the state machine to reach bs_complete before we try to get this state value? i'm also not sure what multi_msg is supposed to mean/do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we trying to assert here whether there should be multiple messages or just checking that if there happen to be multiple messages that they end in a complete bootstrap? Can some bootstraps in this test only have one message - how do we enforce that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using the wait_for_download() to wait for the bootstrap to complete, since the server is not supposed to send the MARK response message until after the role change server initiated bootstrap.

Since the integration of the bootstrap is performed when the final bootstrap message is received, the MARK response is not processed until after the bootstrap has been integrated, so the state machine will be updated to the expected state before the wait_for_download() returns.

test/object-store/sync/flx_sync.cpp Outdated Show resolved Hide resolved

// If this is the first LastInBatch=true message after one or more
// LastInBatch=false messages, this is the end of a bootstrap
if (m_last_download_batch_state == DownloadBatchState::MoreToCome) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wish we didn't have to add this state tracking, but I think I see why we need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it's so we can catch the LastInBatch=false => LastInBatch=true sequence for the server initiated bootstraps. Can you think of a better way?

FLXSyncTestHarness harness("flx_role_change_bootstrap", {person_schema, {"role", "firstName", "lastName"}});
auto& app_session = harness.session().app_session();
// Enable the role change bootstraps
// REQUIRE(app_session.admin_api.set_feature_flag(app_session.server_app_id, "allow_permissions_bootstrap",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to need this api in the tests? if not, i'm wondering if we should keep the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? - right now, the feature is enabled for v14 on the branch I am using, but it may go back to the feature flag once that branch is merged into master.
These will be no longer needed, and therefore removed, once the feature branch is ready to merge into master.

@@ -4927,6 +4927,255 @@ TEST_CASE("flx: nested collections in mixed", "[sync][flx][baas]") {
CHECK(nested_list.get_any(1) == "foo");
}

TEST_CASE("flx: role change bootstrap", "[sync][flx][baas][role_change][bootstrap]") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just my 2 cents, but I think the test(s) can be simplified a great deal by doing the following:

  • instead of using register_connection_change_callback you can use the debug hook and check the client receives a 200 error
  • for query version 1 you can count all non-empty download messages (the role change bootstrap essentially) and assert on that (additionally, once all messages are received (last_in_batch=true for BootstrapMessageProcessed) you can check the PendingBootstrapStore stored the right data)
  • finally, you assert the expected data is in the realm
  • if you want multiple bootstrap in a single test, you can optionally pause the session, change the roles, and resume the session to test role changes detection at ident

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points - thanks for the recommendations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't mention the state machine, but I'm not sure you need it, do you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state machine is being used for two reasons:

  • Wait for the 200 error to be received
  • Verify the order of receiving a server initiated bootstrap with multiple messages, so we track the order/steps of MoreToCome, LastInBatch and then bootstrap processed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't wait_for_download enough for all of this? and you can do the tracking in on_sync_client_event_hook

Copy link
Contributor

@jbreams jbreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a hopefully useful comment about using test command futures.

REQUIRE(cur_state == TestState::reconnect_received);
if (auto session = weak_session.lock()) {
logger->trace("ROLE CHANGE: sending PAUSE test command after resumed");
test_command_futures.push_back(pause_download_builder(*session, true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the test_command_futures vector and associated machinery is just to verify that you've reached the session_resumed state after the test command has been sent. So it might be simpler to let the future do that for you with something like this.

if (send_test_command) {
    REQUIRE(cur_state == TestState::reconnect_received);
    auto session = weak_session.lock();
    REQUIRE(session);
    pause_download_builder.get_async([&](StatusWith<std::string> payload) {
        REQUIRE(payload.is_ok());
        machina.transition_with([&](TestState state) {
              return TestState::session_resumed;
        });
    });
    return std::nullopt;
} else {
    return TestState::session_resumed;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That simplifies things definitely. And the only thing the future callback needs to check for is success; it doesn't need to do anything with the state machine.

@@ -1470,6 +1467,10 @@ inline void ClientImpl::Session::connection_established(bool fast_reconnect)
++m_target_download_mark;
}

// Call SessionResumed before sending the BIND Message to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this comment and why it has to be here (same below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is primarily here so it doesn't get moved. The role change test needs to be notified of the session resumed/connected prior to sending the BIND message so it can queue up the test command to be sent before the IDENT message. If this notification happens after the BIND message, there isn't enough time for the test to queue up the test command before the IDENT message is sent.

I updated the message to hopefully be more clear.

Copy link
Collaborator

@danieltabacaru danieltabacaru Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I find it a bit hack-ish though. IIUC, there is a scheduling problem because send_test_command posts to the event loop. What if instead we update m_pending_test_commands directly under a lock? And then you can invoke send_test_command from the event loop when sending BIND (you actually don't need the lock if you create a new method only to be invoked from the event loop). Would that work for your tests? @jbreams what do you think of this approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but there isn't a good place to do this, unless I put plumbing in the sync session to add directly to the list of test commands. This is the only thing available in the event hook callback functions.

The current approach ensures the callback to add the test command gets posted before the BIND message is sent and the callback for async_write_binary() is run on the event loop to send the next message after the BIND.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, i think this is fine for now. another approach could be to send the test commands when you get the 200 disconnect since i think we preserve the list of pending test commands across disconnects, but i wouldn't want to hold this project up over it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the list of test commands sticks around for the lifetime of the ClientImpl::Session and the only time commands are removed are when a response is received or the session object is destroyed.
The current send_test_command() logic checks to make sure the session is currently active, but that is easy enough to update if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the session only becomes inactive if you pause() it or destroy the realm - i think in the tests we have right now it should survive a disconnect/reconnect.

@@ -60,6 +60,9 @@ namespace sync {
// 13 Support for syncing collections (lists and dictionaries) in Mixed columns and
// collections of Mixed
//
// 14 Support for server initiated bootstraps, including bootstraps for role/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to update get_current_protocol_version() to return the new version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was waiting until this feature was about to be merged, but it really doesn't matter, since I'm using a feature branch and the feature is behind a feature flag.


auto setup_harness = [&](FLXSyncTestHarness& harness, TestParams params) {
auto& app_session = harness.session().app_session();
/** TODO: Remove when switching to use Protocol version in RCORE-1972 */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're doing it as part of this pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the server hasn't been updated to use the protocol version yet... I believe it is still behind the feature flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it is just controlled by a feature flag


// Add client reset callback to verify a client reset doesn't happen
config.sync_config->notify_before_client_reset = [&](std::shared_ptr<Realm>) {
did_client_reset = true;
Copy link
Collaborator

@danieltabacaru danieltabacaru Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: FAIL() (and no need for did_client_reset)

@michael-wb michael-wb merged commit a988773 into feature/role-change Jun 11, 2024
35 of 39 checks passed
@michael-wb michael-wb deleted the mwb/role-change-bootstraps branch June 11, 2024 17:13
@michael-wb michael-wb linked an issue Jun 20, 2024 that may be closed by this pull request
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync client should allow server bootstrapping at any time Bump sync protocol version to v14
4 participants