Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update server version on all entries in the sync history during client reset with recovery #7291

Merged
merged 4 commits into from
Jan 29, 2024

Conversation

danieltabacaru
Copy link
Collaborator

@danieltabacaru danieltabacaru commented Jan 24, 2024

What, How & Why?

When performing recovery after a client reset, the server version of the recovered changesets is updated with the latest server version. The changesets recovered are only the changesets with actual changes, but empty changesets may have been created when a new subscriptions set is committed for example. The server version of such changesets was not updated, leading to Bad server version errors during uploads.

The server version is now updated on all entries in the sync history.

Fixes #7279.

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • [ ] C-API, if public C++ API changed
  • [ ] bindgen/spec.yml, if public C++ API changed

Copy link

coveralls-official bot commented Jan 24, 2024

Pull Request Test Coverage Report for Build daniel.tabacaru_757

  • 0 of 69 (100.0%) changed or added relevant lines in 2 files are covered.
  • 68 unchanged lines in 17 files lost coverage.
  • Overall coverage increased (+0.01%) to 91.851%

Files with Coverage Reduction New Missed Lines %
test/test_query2.cpp 1 99.08%
test/test_util_network_ssl.cpp 1 90.71%
src/realm/object-store/shared_realm.cpp 2 93.03%
src/realm/sync/network/http.hpp 2 83.33%
test/test_lang_bind_helper.cpp 2 93.31%
test/test_sync.cpp 2 94.14%
src/realm/sort_descriptor.cpp 3 93.7%
src/realm/util/future.hpp 3 95.98%
test/test_util_network.cpp 3 97.2%
src/realm/sync/instructions.hpp 4 75.76%
Totals Coverage Status
Change from base Build 1990: 0.01%
Covered Lines: 235211
Relevant Lines: 256080

💛 - Coveralls

@danieltabacaru danieltabacaru marked this pull request as ready for review January 25, 2024 09:48
Copy link
Contributor

@cmelchior cmelchior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that with these changes, the errors we saw in the server logs in Kotlin are now gone.

auto i = size_t(version - m_sync_history_base_version);
util::compression::allocate_and_compress_nonportable(arena, changeset, compressed);
m_arrays->changesets.set(i, BinaryData{compressed.data(), compressed.size()}); // Throws
// Server version is updated for *every* entry in the sync history to ensure that server versions don't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing this as two separate loops is a bit clearer (and it's trivially faster, but that doesn't matter):

        for (auto& [changeset, version] : recovered_changesets) {
            uploadable_bytes += changeset.size();
            auto i = size_t(version - m_sync_history_base_version);
            util::compression::allocate_and_compress_nonportable(arena, changeset, compressed);
            m_arrays->changesets.set(i, BinaryData{compressed.data(), compressed.size()}); // Throws
        }
        for (size_t i = 0, j = 0, size = m_arrays->remote_versions.size(); i < size; ++i) {
            m_arrays->remote_versions.set(i, server_version.version);
        }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially implemented it like that, then changed my mind. I will make the change.

Copy link
Contributor

@cmelchior cmelchior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more testing, I'm not sure if this fully fixes the issue.

We are now seeing another test (also asymmetric) that now fails with this:

Error: client file not found. The server has forgotten about the client-side file presented by the client. This is likely due to using a synchronized realm after terminating and re-enabling sync. Please wipe the file on the client to resume synchronization. { sessionIdent: 2, clientFileIdent: 35 } (ProtocolErrorCode=208)

But I do not see how we can trigger a Client Reset as this is the first time the Client is connecting to the server

The local log is attached: log-crash.txt

In the server logs there is a missing file ident as well, but not sure how that is possible:

image

@danieltabacaru
Copy link
Collaborator Author

@cmelchior the client reset seems unrelated. It's odd that the synchronization starts (the server even sends download messages), and then sends that error. maybe the server folks should have a look if you have the logs.

@jbreams
Copy link
Contributor

jbreams commented Jan 25, 2024

FWIW, i think there's a race in your test or the server. You get a client reset error on the main realm and then the sync client tries to start a new realm to do the client reset with

ost:9090/groups/65b2950ece9e509bb746900f/apps/65b29528ce9e509bb7469020/logs?co_id=65b2952fce9e509bb74692cf","shouldClientReset":false,"action":"ClientReset","backoffMaxDelaySec":300,"backoffIntervalSec":2,"backoffMultiplier":2}
2024-Jan-25 06:06:55,210 INFO: [REALM] [Core] Connection[1]: Session[1]: Received: ERROR "Synchronization no longer possible for client-side file: breaking schema change applied" (error_code=217, is_fatal=true, error_action=ClientReset)
2024-Jan-25 06:06:55,210 DEBUG: [REALM] [Core] Connection[1]: Session[1]: Suspended
2024-Jan-25 06:06:55,234 DEBUG: [REALM] [Core] Connection[1]: Session[1]: Sending: UNBIND
2024-Jan-25 06:06:55,236 INFO: [REALM] [Core] Connection[2]: Session[2]: Binding '/Users/cm/Realm/realm-kotlin/packages/test-sync/mongodb-realm/test-app-flx-pflva/65b2952cce9e509bb74690ce/default.realm.fresh' to ''
...
2024-Jan-25 06:06:55,254 INFO: [REALM] [Core] Connection[2]: Connected to app services with request id: "65b2952fce9e509bb7469383"
2024-Jan-25 06:06:55,254 DEBUG: [REALM] [Core] Connection[2]: Will emit a ping in 24029 milliseconds
2024-Jan-25 06:06:55,254 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: BIND(session_ident=2, need_client_file_ident=true, is_subserver=false, json_data="{"schemaVersion":0,"sessionReason":1}")
2024-Jan-25 06:06:55,273 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Received: IDENT(client_file_ident=35, client_file_ident_salt=3873997105579513191)
2024-Jan-25 06:06:55,274 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: IDENT(client_file_ident=35, client_file_ident_salt=3873997105579513191, scan_server_version=0, scan_client_version=0, latest_server_version=0, latest_server_version_salt=0, query_version=0, query_size=2, query="{}")

Then that second realm that's going to do the client reset gets disconnected a few times.

2024-Jan-25 06:06:55,274 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: MARK(request_ident=1)
2024-Jan-25 06:06:55,285 INFO: [REALM] [Core] Connection[2]: Closing the websocket with error code=WebSocket: OK, message='', was_clean=true
2024-Jan-25 06:06:57,286 ERROR: [REALM] [Core] Reading failed: End of input
2024-Jan-25 06:06:57,286 INFO: [REALM] [Core] Connection[2]: Closing the websocket with error code=WebSocket: Read Error, message='End of input', was_clean=false
2024-Jan-25 06:06:57,286 INFO: [REALM] [Core] Connection[2]: Connection closed due to transient error: ConnectionClosed: End of input
2024-Jan-25 06:06:57,286 DEBUG: [REALM] [Core] Connection[2]: Allowing reconnection in 942 milliseconds
2024-Jan-25 06:06:58,229 INFO: [REALM] [Core] Connection[2]: Connecting to 'ws://localhost:9090/api/client/v2.0/app/test-app-flx-pflva/realm-sync'
...
2024-Jan-25 06:06:58,238 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: BIND(session_ident=2, need_client_file_ident=false, is_subserver=false, json_data="{"schemaVersion":0,"sessionReason":1}")
2024-Jan-25 06:06:58,238 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: IDENT(client_file_ident=35, client_file_ident_salt=3873997105579513191, scan_server_version=0, scan_client_version=0, latest_server_version=0, latest_server_version_salt=0, query_ver
sion=0, query_size=2, query="{}")
2024-Jan-25 06:06:58,238 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: MARK(request_ident=2)
2024-Jan-25 06:06:58,598 INFO: [REALM] [Core] Connection[2]: Closing the websocket with error code=WebSocket: OK, message='', was_clean=true
2024-Jan-25 06:07:00,599 ERROR: [REALM] [Core] Reading failed: End of input
2024-Jan-25 06:07:00,599 INFO: [REALM] [Core] Connection[2]: Closing the websocket with error code=WebSocket: Read Error, message='End of input', was_clean=false
2024-Jan-25 06:07:00,599 INFO: [REALM] [Core] Connection[2]: Connection closed due to transient error: ConnectionClosed: End of input
2024-Jan-25 06:07:00,599 DEBUG: [REALM] [Core] Connection[2]: Allowing reconnection in 1903 milliseconds
2024-Jan-25 06:07:02,503 INFO: [REALM] [Core] Connection[2]: Connecting to 'ws://localhost:9090/api/client/v2.0/app/test-app-flx-pflva/realm-sync'
...
2024-Jan-25 06:07:02,524 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: BIND(session_ident=2, need_client_file_ident=false, is_subserver=false, json_data="{"schemaVersion":0,"sessionReason":1}")
2024-Jan-25 06:07:02,525 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: IDENT(client_file_ident=35, client_file_ident_salt=3873997105579513191, scan_server_version=0, scan_client_version=0, latest_server_version=0, latest_server_version_salt=0, query_ver
sion=0, query_size=2, query="{}")
2024-Jan-25 06:07:02,525 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: MARK(request_ident=3)
2024-Jan-25 06:07:02,534 TRACE: [REALM] [Core] Connection[2]: Error message encoded as json: {"message":"Client tried to connect using flexible sync before initial sync is complete","isRecoveryModeDisabled":false,"tryAgain":true,"logURL":"http://localhost:9090/grou
ps/65b2950ece9e509bb746900f/apps/65b29528ce9e509bb7469020/logs?co_id=65b29536ce9e509bb7469430","shouldClientReset":false,"action":"Transient","backoffMaxDelaySec":300,"backoffIntervalSec":2,"backoffMultiplier":2}
2024-Jan-25 06:07:02,534 INFO: [REALM] [Core] Connection[2]: Session[2]: Received: ERROR "Client tried to connect using flexible sync before initial sync is complete" (error_code=229, is_fatal=false, error_action=Transient)
2024-Jan-25 06:07:02,534 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Suspended
2024-Jan-25 06:07:02,534 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Will attempt to resume session after 1737 milliseconds
2024-Jan-25 06:07:02,534 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: UNBIND
2024-Jan-25 06:07:02,534 INFO: [REALM] [Core] Connection[2]: Disconnected
2024-Jan-25 06:07:04,272 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Resumed

And then you get a client reset error for the fresh realm being downloaded and that causes the whole test to fail...

2024-Jan-25 06:07:04,280 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: BIND(session_ident=2, need_client_file_ident=false, is_subserver=false, json_data="{"schemaVersion":0,"sessionReason":1}")
2024-Jan-25 06:07:04,280 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: IDENT(client_file_ident=35, client_file_ident_salt=3873997105579513191, scan_server_version=0, scan_client_version=0, latest_server_version=0, latest_server_version_salt=0, query_version=0, query_size=2, query="{}")
2024-Jan-25 06:07:04,280 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: MARK(request_ident=4)
2024-Jan-25 06:07:04,284 TRACE: [REALM] [Core] Connection[2]: Error message encoded as json: {"message":"Bad client file identifier (IDENT)","isRecoveryModeDisabled":false,"tryAgain":false,"logURL":"http://localhost:9090/groups/65b2950ece9e509bb746900f/apps/65b29528ce9e509bb7469020/logs?co_id=65b29538ce9e509bb74694b6","shouldClientReset":true,"action":"ClientReset","backoffMaxDelaySec":300,"backoffIntervalSec":2,"backoffMultiplier":2}
2024-Jan-25 06:07:04,284 INFO: [REALM] [Core] Connection[2]: Session[2]: Received: ERROR "Bad client file identifier (IDENT)" (error_code=208, is_fatal=true, error_action=ClientReset)
2024-Jan-25 06:07:04,284 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Suspended
2024-Jan-25 06:07:04,293 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Sending: UNBIND
2024-Jan-25 06:07:04,293 DEBUG: [REALM] [Core] Connection[1]: Session[1]: Initiating deactivation
2024-Jan-25 06:07:04,293 DEBUG: [REALM] [Core] Connection[1]: Session[1]: Deactivation completed
2024-Jan-25 06:07:04,293 ERROR: [REALM] [Sdk] Client reset: manual reset required for Realm in '/Users/cm/Realm/realm-kotlin/packages/test-sync/mongodb-realm/test-app-flx-pflva/65b2952cce9e509bb74690ce/default.realm'
2024-Jan-25 06:07:04,293 ERROR: [REALM] [Sdk] Client reset: manual reset required for Realm in '/Users/cm/Realm/realm-kotlin/packages/test-sync/mongodb-realm/test-app-flx-pflva/65b2952cce9e509bb74690ce/default.realm.fresh'
2024-Jan-25 06:07:04,293 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Initiating deactivation
2024-Jan-25 06:07:04,293 DEBUG: [REALM] [Core] Connection[1]: Destroying connection object
2024-Jan-25 06:07:04,293 DEBUG: [REALM] [Core] Connection[2]: Session[2]: Deactivation completed
2024-Jan-25 06:07:04,293 INFO: [REALM] [Core] Connection[2]: Disconnected
2024-Jan-25 06:07:04,293 DEBUG: [REALM] [Core] Connection[2]: Destroying connection object
2024-Jan-25 06:07:04,294 TRACE: [REALM] [Sdk] WriterRealm[/Users/cm/Realm/realm-kotlin/packages/test-sync/mongodb-realm/test-app-flx-pflva/65b2952cce9e509bb74690ce/default.realm}] CLOSE-ACTIVE VersionId(version=12)
2024-Jan-25 06:07:04,294 TRACE: [REALM] [Sdk] WriterRealm[/Users/cm/Realm/realm-kotlin/packages/test-sync/mongodb-realm/test-app-flx-pflva/65b2952cce9e509bb74690ce/default.realm}] CLOSE-ACTIVE VersionId(version=8)
2024-Jan-25 06:07:04,294 INFO: [REALM] [Sdk] Realm closed: WriterRealm[/Users/cm/Realm/realm-kotlin/packages/test-sync/mongodb-realm/test-app-flx-pflva/65b2952cce9e509bb74690ce/default.realm}]
2024-Jan-25 06:07:04,294 TRACE: [REALM] [Sdk] NotifierRealm[/Users/cm/Realm/realm-kotlin/packages/test-sync/mongodb-realm/test-app-flx-pflva/65b2952cce9e509bb74690ce/default.realm}] CLOSE-ACTIVE VersionId(version=12)
2024-Jan-25 06:07:04,295 INFO: [REALM] [Sdk] Realm closed: NotifierRealm[/Users/cm/Realm/realm-kotlin/packages/test-sync/mongodb-realm/test-app-flx-pflva/65b2952cce9e509bb74690ce/default.realm}]
2024-Jan-25 06:07:04,295 INFO: [REALM] [Sdk] Realm closed: RealmImpl[/Users/cm/Realm/realm-kotlin/packages/test-sync/mongodb-realm/test-app-flx-pflva/65b2952cce9e509bb74690ce/default.realm}]

Are you sure the test is waiting for sync to be terminated/restarted before trying to client reset?

@danieltabacaru
Copy link
Collaborator Author

danieltabacaru commented Jan 25, 2024

@jbreams Interesting. This is exactly what @cmelchior is seeing. It could be I am triggering the client reset too early (although I would have expected a different error). I'll take a look (the test is pretty much a copy of a different one). LE: Christian's test has nothing to do with my test.

@jbreams
Copy link
Contributor

jbreams commented Jan 25, 2024

Yeah, those log lines were pulled from @cmelchior's log file he attached in this ticket.

@cmelchior
Copy link
Contributor

cmelchior commented Jan 26, 2024

I reran the test and pulled both server and local logs. Now I saw a new error: io.realm.kotlin.mongodb.exceptions.SyncException: [Logic][InvalidSchemaChange(2019)] Invalid schema change (UPLOAD): error finalizing schema changes: failed to update 13 app schemas with new property. A concurrent change to the app has been detected and the client will need to try again. If this error occurs consistently, the app may be in a corrupted state due to conflicting concurrent development mode schema changes, and you may need to terminate and restart sync before trying again Logs: http://localhost:9090/groups/65b35754c6e66dac4d8a243b/apps/65b35767c6e66dac4d8a244c/logs?co_id=65b3576ec6e66dac4d8a2729.

Logs attached

log-crash.txt
baas_server.log

EDIT: I deleted my local baas install and rebuilt it from scratch, then the test succeded. Will try to experiment with the test setup

@danieltabacaru
Copy link
Collaborator Author

@tkaye407 / @kmorkos can you please take a look at the server logs?

@cmelchior
Copy link
Contributor

Same test, from a fresh BAAS but with platform networkin enabled. This one just ends up hanging with no error
baas_server.log.zip
log-crash.txt

@kmorkos
Copy link
Contributor

kmorkos commented Jan 26, 2024

Same test, from a fresh BAAS but with platform networkin enabled. This one just ends up hanging with no error baas_server.log.zip log-crash.txt

I don't see any requests to the server after the request to revoke the session which succeeds, so this all looks fine from the server's perspective.

@danieltabacaru
Copy link
Collaborator Author

@kmorkos Can you have a look at the other logs where Invalid schema change (UPLOAD): error finalizing schema changes: failed to update 13 app schemas with new property is received?

@kmorkos
Copy link
Contributor

kmorkos commented Jan 26, 2024

@kmorkos Can you have a look at the other logs where Invalid schema change (UPLOAD): error finalizing schema changes: failed to update 13 app schemas with new property is received?

Those are transient. It just means that the test was concurrently making dev mode additive schema changes from a device while also doing other stuff with the app via the API. We could make that retry internally, but those are not caused by the changes here so should not need to block this change IMO.

@cmelchior
Copy link
Contributor

Hmm, I only saw the error once so maybe that makes sense.

I do not want to hold up this PR for too long, but something has definitely broken the Kotlin test suite after we updated to Core 13.25.0 and it is surfacing as upload listeners not being triggered which ends up timing out our tests. But I should probably just create a new issue for that.

@jbreams
Copy link
Contributor

jbreams commented Jan 26, 2024

@cmelchior , which test is failing specifically? Can you link it here?

@cmelchior
Copy link
Contributor

We are seeing quite a lot of different ones, but the ones I have been focusing on is around asymmetric sync.

Specifically these to: https://github.com/realm/realm-kotlin/blob/ec7189244d7d1c4f7c5544fdfc0a00a4cf9ceef4/packages/test-sync/src/commonTest/kotlin/io/realm/kotlin/test/mongodb/common/AsymmetricSyncTests.kt#L133 and https://github.com/realm/realm-kotlin/blob/ec7189244d7d1c4f7c5544fdfc0a00a4cf9ceef4/packages/test-sync/src/commonTest/kotlin/io/realm/kotlin/test/mongodb/common/AsymmetricSyncTests.kt#L307

There I managed to reliably reproduce a error by starting a fresh BAAS, then running the single test with Platform networking enabled.

Without platform networking these tests seem to work, but then we have others failing. So I'm not sure it is related to this.

Our tests will detect if the test-app they need is present and if not, use the Admin API to create it. I suspect that this is somehow causing issues as adding artificial delays to our tests after creating the app seems to make some of the errors go away.

The weird thing is that all of this worked prior to Core 13.25.0, so not sure what changed there.

@cmelchior
Copy link
Contributor

I created #7297 where I tried to describe it a bit more. I'll do some more testing and add any new logs to that issue instead.

Copy link
Contributor

@michael-wb michael-wb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes and the test look good

@danieltabacaru danieltabacaru merged commit 5533505 into master Jan 29, 2024
38 checks passed
@danieltabacaru danieltabacaru deleted the dt/client_reset_recovery branch January 29, 2024 23:15
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty changesets are not updated accordingly during the recovery phase of a client reset
6 participants