MongoDB Realm Cloud sync fails with "Bad sync process received (6)" error #3508

duncangroenewald · 2021-01-16T21:22:50Z

Goals

Load data into a synced MongoDB Realm App

Expected Results

Script loads data into the synced MongoDB Realm App and data is synced with MongoDB Atlas and can be downloaded by another Realm client application

Actual Results

Script runs and data load completes (into the local realm file) but sync process fails - see errors below.

Restarting the script in query mode (does not load data but queries the record count for each object type) appears to resume the sync process.

The sync performance also seems much faster than it was when running the same script yesterday so not sure if some server side improvements have been made recently as well. They sync completed successfully after the script was restarted once - previous attempts have seem multiple "Bad sync process received" failures and required multiple restarts of the client script.

I also tested with a new client application and this appears to now be working - the client app successfully downloads the synced realm and reports the correct record counts.

Jan 17 2021 08:13:03 : SYNC Connection[1]: Session[1]: Upload compaction: original size = 594357, compacted size = 594357
Jan 17 2021 08:13:04 : SYNC Connection[1]: Download message compression: is_body_compressed = 0, compressed_body_size=0, uncompressed_body_size=0
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Received: DOWNLOAD(download_server_version=42, download_client_version=45, latest_server_version=42, latest_server_version_salt=9174491270143759137, upload_client_version=45, upload_server_version=2, downloadable_bytes=0, num_changesets=0, ...)
Jan 17 2021 08:13:04 : SYNC Using already open Realm file: /Users/duncangroenewald/Development/RealmMigrationMongoDB/mongodb-realm/makespace-development-flekz/600142ae1e85d603d186062f/s_default.realm
Jan 17 2021 08:13:04 : SYNC Using already open Realm file: /Users/duncangroenewald/Development/RealmMigrationMongoDB/mongodb-realm/makespace-development-flekz/600142ae1e85d603d186062f/s_default.realm
UPLOAD: [35952426] 83090339
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Progress handler called, downloaded = 4940, downloadable(total) = 4940, uploaded = 35952426, uploadable = 83090339, reliable_download_progress = 1, snapshot version = 170
Jan 17 2021 08:13:04 : SYNC Using already open Realm file: /Users/duncangroenewald/Development/RealmMigrationMongoDB/mongodb-realm/makespace-development-flekz/600142ae1e85d603d186062f/s_default.realm
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Sending: UPLOAD(progress_client_version=141, progress_server_version=7, locked_server_version=42, num_changesets=1)
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Fetching changeset for upload (client_version=141, server_version=7, changeset_size=355116, origin_timestamp=190761040945, origin_file_ident=0)
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Changeset: 3F 00 0A 55 73 65 72 43 68 61 6E 67 65 3F 01 07 41 70 70 55 73 65 72 3F 02 24 42 33 38 38 33 43 43 37 2D 44 32 35 46 2D 34 32 32 45 2D 41 44 46 34 2D 33 37 44 44 34 33 
...
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Upload compaction: original size = 355116, compacted size = 355116
Jan 17 2021 08:13:04 : SYNC Using already open Realm file: /Users/duncangroenewald/Development/RealmMigrationMongoDB/mongodb-realm/makespace-development-flekz/600142ae1e85d603d186062f/s_default.realm
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Sending: UPLOAD(progress_client_version=170, progress_server_version=42, locked_server_version=42, num_changesets=2)
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Fetching changeset for upload (client_version=142, server_version=7, changeset_size=3180, origin_timestamp=190761040955, origin_file_ident=0)
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Changeset: 3F 00 09 55 73 65 72 47 72 6F 75 70 3F 01 24 37 38 44 30 44 33 37 33 2D 45 35 36 41 2D 34 30 38 44 2D 39 32 38 30 2D 34 30 38 34 38 33 38 36 37 34 30 45 3F 02 05 75 73 
... 
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Upload compaction: original size = 3180, compacted size = 3180
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Fetching changeset for upload (client_version=143, server_version=7, changeset_size=20302, origin_timestamp=190761040973, origin_file_ident=0)
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Changeset: 3F 00 06 56 65 6E 64 6F 72 3F 01 24 37 39 34 30 46 43 
...
Jan 17 2021 08:13:04 : SYNC Connection[1]: Session[1]: Upload compaction: original size = 20302, compacted size = 20302
Jan 17 2021 08:13:10 : SYNC Connection[1]: Download message compression: is_body_compressed = 0, compressed_body_size=0, uncompressed_body_size=0
Jan 17 2021 08:13:10 : SYNC Connection[1]: Session[1]: Received: DOWNLOAD(download_server_version=43, download_client_version=44, latest_server_version=43, latest_server_version_salt=6151725616758194912, upload_client_version=45, upload_server_version=2, downloadable_bytes=0, num_changesets=0, ...)
Jan 17 2021 08:13:10 : SYNC Connection[1]: Session[1]: Bad sync progress received (6)
Jan 17 2021 08:13:10 : SYNC Connection[1]: Connection closed due to error
Jan 17 2021 08:13:10 : SYNC Connection[1]: Allowing reconnection in 2754499 milliseconds
Jan 17 2021 08:13:10 : SYNC Connection[1]: Session[1]: Initiating deactivation
Jan 17 2021 08:13:10 : SYNC Connection[1]: Session[1]: Deactivation completed
Jan 17 2021 08:13:10 : SYNC Closing Realm file: /Users/duncangroenewald/Development/RealmMigrationMongoDB/mongodb-realm/makespace-development-flekz/600142ae1e85d603d186062f/s_default.realm
Jan 17 2021 08:13:10 : SYNC Connection[1]: Destroying connection object

There appear to be no errors in the MongoDB Realm App logs

and the last write details

Steps to Reproduce

See issue #3503 (comment)

Code Sample

See issue #3503 (comment)

Note that I have raised a support case and fun script and sample data is attached to that.

Version of Realm and Tooling

Realm JS SDK Version: 10.1.3
Node or React Native: 12.20.0
Client OS & Version: macOS11 (M1) rosetta
Which debugger for React Native: None

The text was updated successfully, but these errors were encountered:

duncangroenewald · 2021-01-19T01:09:31Z

FYI - I was unable to initialise a new client yesterday as it was failing to open the realm and start the initial sync (download data to the client). The client was sitting in an active state this morning and suddenly started working. Deleting the client and restarting it worked in subsequent attempts so perhaps there was a cloud sync service problem that was fixed today.

I am going to delete the Realm Cloud App and rerun the load client to see if anything has changed during the sync of the client loaded data (i.e. upload to cloud service).

fronck · 2021-01-22T10:53:37Z

@duncangroenewald Thanks for bringing this to our attention. I have reached out to our backend/sync teams to dig into this.
In the meanwhile, please let us know if you are able to resolve the issue using the steps you outlined.

Would you have any logs from the cloud side from the time that you saw the error?

duncangroenewald · 2021-01-22T19:35:05Z

@fronck - every time I run the test I seem to get a different result. As far as I can recall there was no error showing in the log but I do see a lot of errors relating to "error integrating changeset ....".

There are a lot of these errors:

Jan 23 2021 06:30:22 : SYNC Connection[1]: Connection closed due to error
Jan 23 2021 06:30:22 : SYNC Connection[1]: Connected to endpoint '52.64.157.195:443' (from '10.0.1.171:54993')

with server side logs looking like this:

So far I have not had much success with reliable syncing - sync seems to crash before the client has completed uploading all change sets. I am assuming Sync is more like in Alpha state than in Beta state.

duncangroenewald · 2021-01-22T19:57:11Z

Here is another set of logs from running the test this morning.

Jan 23 2021 06:47:15 : UPLOAD: [9945943] 83729363
Jan 23 2021 06:47:20 : updating links in: UserGroup
Jan 23 2021 06:47:20 : UPLOAD: [9945943] 83732543
Jan 23 2021 06:47:20 : UPLOAD: [9945943] 83732543
Jan 23 2021 06:47:25 : updating links in: Vendor
Jan 23 2021 06:47:25 : UPLOAD: [9945943] 83752845
Jan 23 2021 06:47:25 : UPLOAD: [9945943] 83752845
Jan 23 2021 06:47:30 : Links update completed
Jan 23 2021 06:47:30 : Enter to quit
> Jan 23 2021 06:48:20 : SYNC Connection[1]: Connection closed due to error
Jan 23 2021 06:48:20 : SYNC Connection[1]: Connected to endpoint '52.64.157.195:443' (from '10.0.1.171:55121')
Jan 23 2021 06:48:55 : UPLOAD: [11368974] 83752845
Jan 23 2021 06:50:26 : SYNC Connection[1]: Connection closed due to error
Jan 23 2021 06:50:26 : SYNC Connection[1]: Connected to endpoint '52.64.157.195:443' (from '10.0.1.171:55159')
Jan 23 2021 06:51:28 : UPLOAD: [11368974] 83752845
Jan 23 2021 06:53:14 : SYNC Connection[1]: Connection closed due to error
Jan 23 2021 06:53:14 : SYNC Connection[1]: Connected to endpoint '52.64.157.195:443' (from '10.0.1.171:55332')
Jan 23 2021 06:54:20 : UPLOAD: [13165862] 83752845
Jan 23 2021 06:54:31 : SYNC Connection[1]: Session[1]: Bad sync progress received (6)
Jan 23 2021 06:54:31 : SYNC Connection[1]: Connection closed due to error

Javascript client stalls at this point - well no more console output is generated by the sync progress report callback.

Server logs:

But it seems the client is still syncing data to the server even though the client no longer creates any log output to indicate the upload status. And some minutes later there is a disconnect.

jbreams · 2021-01-22T21:09:21Z

Hello @duncangroenewald, thank you for these logs and updates. The Bad sync progress received errors indicate that the client has received a download message where the sync session's point in history (how much it has uploaded and downloaded to and from the sync server) appear to be corrupted. I've been going through the backend debug logs to try to correlate the logs you've provided with behavior on the server. A few things stand out:

A number of writes that take a really long time to complete - on the order of minutes - and looking at some of the backend debugging logs there appear to be a number of retries due to write conflicts.
Your app is currently running on the M0 shared tier of atlas, so it's possible you have a noisy neighbor in the shared tier that is exacerbating slow-downs and write conflicts.
There appear to be multiple connections to the same realm partition at the same time - and it's possible they are conflicting with each other and potentially corrupting each others sync state. I found 11 connections to the same fileIdent between 19:27:50 and 19:53:40 (UTC) - those seem to match up to the errors in the log snippet from your last update.

At this point I think you're hitting a real bug somewhere on our end - I have a working theory that there's a bad interaction between the sync client's keepalive PING/PONG message timeouts and long-running upload integration attempts. Unfortunately the sync logs are not at a high enough verbosity to see exactly what's going on on the client side.

Could you set the sync client's log level to "all", rerun your migration script, and upload the whole output to this secure uploader link? You can do this by adding Realm.App.Sync.setLogLevel(app, "all"); right after you call Realm.App.Sync.setLogger(app, (level, message) => console.log("SYNC: [" + level + "] " + message));.

Thanks for your patience on this!

duncangroenewald · 2021-01-22T23:30:27Z

@jbreams - I have uploaded the js script, a RealmSwift sample app and the source data file as part of a support request so you could set it up and run it yourself. In the meantime I will run it again with log level to all and save the output to a file.

Note that I have to terminate sync and delete the Atlas database and then re-enable sync on the Realm App to clear out the data.

I can try setting up another tier of Atlas but I did try that at one stage and got the same result if I remember. In a few months of trying I think I have never managed to get a complete sync without having to restart the client after these "Bad sync state" errors.

Ideally you should try running the script on your side to see if you can replicate the issues. Feel free to use my Atlas account - just let me know so I don't try using it at the same time.

For now I will rerun and save the log output from the client for you.

jbreams · 2021-01-23T01:31:37Z

@duncangroenewald, I've actually run the sample JS script you provided with the source data file several times without seeing this error before asking you for more log output. What geographical region are you running your script and app in? When I re-ran your script to try to troubleshoot this problem, I ran it against an M0 shared tier Atlas cluster in the us-east-1 region like yours, but I'm pretty close to there and have pretty low latency - so maybe that's a change I should make to get a more accurate reproduction. Also, when you set up your Realm app, did you select a global or local deployment?

Either way, if this error happens very consistently for you, getting a log with trace-level log output would be very helpful to correlate the problems you're seeing with the debug logging we have on the backend.

duncangroenewald · 2021-01-23T07:59:18Z

OK, I just uploaded the log file to the link above.

Lots of server errors like this

"Error:

Failed to integrate download after attempting the maximum number of tries: retryable error while committing integrated changesets: (WriteConflict) WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction.
Source:

Error syncing MongoDB write
Logs:
[
"Realm partition: "default"",
"timed out after 10 attempts to integrate the downloaded changesets (1m44.758195785s)",
"retryable error while committing integrated changesets: (WriteConflict) WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction."
]"

I am running on macOS 11 M1 Rosetta so will do the same test on Intel just to be sure - I did in the past and got similar errors, if not the same ones but will retest to be sure.

duncangroenewald · 2021-01-23T08:01:13Z

Oh and I am based in Melbourne Australia.

duncangroenewald · 2021-01-23T08:56:22Z

Same errors when running on macOS 11 Intel.

duncangroenewald · 2021-01-23T19:48:55Z

There seem to be a lot of server side errors related to syncing with MongoDB long after the client has been terminated.

jbreams · 2021-01-24T00:08:50Z

@duncangroenewald, thank you for uploading your logs and giving some details about where you are geographically, they definitely filled in a lot of holes in what's going on here. Basically I think your current configuration is basically the worst case scenario in terms of latency. Because your realm app is deployed "globally" - this was an option that was selected when the Realm app was first created - your database server and app server are as physically far apart as it is possible to be on the earth - with your atlas cluster in Northern VA in the US and your app server in Sydney.

Integrating uploads requires a bunch of database round trips to check for conflicts and then do conflict resolution. Looking at the logs you uploaded, the fastest round-trip time from your migrate script to the sync server is ~24ms (which implies you're physically close to the sync server), but as soon as the sync server has to do any db ops the round-trip time goes up to ~5000ms at a minimum. This very high latency and long upload processing time is interacting with a bug and some bad assumptions on the client and server about how to deal with timeouts.

The sequence of events is: the migrate script uploads a whole bunch of changes to integrate and also sends a PING message to the server - that takes a very long time since it involves a ton of db round trips around the world. In the meantime, the client times out waiting for a PONG response from the server because the server has been spending all its time sending db ops around the planet. So the client decides the server is down and opens a new connection and starts uploading again. However the server is still working through the backlog of messages that were sent from the original connection, and to compound things the new messages from the new connection actually have conflicts with the messages from the original connection so that each uploaded set of data must be retried a number of times before succeeding. This cycle continues until an upload message from the original connection succeeds and is out-of-order with respect to an upload message from the new connection and that corrupts the state of the connection which causes the Bad sync progress received error.

The errors about "timed out after 10 attempts to integrate downloaded changesets" and "NoSuchTransaction" also seems to stem from very high latency between the app server and atlas cluster.

This is definitely a bug in both the sync server and sync client that we'll work to address ASAP. I'm also proposing internally that we add some warnings or totally disallow this specific configuration since it's definitely a very easy foot gun. In the meantime, I think if you deleted and re-created your Realm app as a Local app instead of a Global app (so that the atlas cluster and app server have very low latency between each other), a lot of your problems might just go away. You'd have high latency between you and the app server, but each operation on the app server should be able to complete much more quickly. Alternatively you could launch an Atlas cluster in the Sydney AWS region, but then it wouldn't be on the free tier.

Hopefully this is clear - let me know if you have any more questions, and thanks again for your patience.

duncangroenewald · 2021-01-24T01:56:02Z

That makes sense. I will try configuring the local app and see how that goes and also test a Sydney region cluster.

Just did a quick test with a local app and the initial load completes in about 30 seconds so that's looking promising.

Some warnings on the global deployment might be a good idea.

duncangroenewald · 2021-01-27T06:50:47Z

There still seem to be some issues with server errors showing up after connecting the client. The client appears to be working but further testing is required to tell if these errors have any impact on the client functions or data.

Details are attached to the support request.

kneth · 2021-09-01T08:42:21Z

A fix might be realm/realm-core#4878

sync-by-unito · 2021-09-13T14:22:13Z

➤ Jonathan Reams commented:

I think this may have fallen through the cracks of scheduling and got conflated with several other issues. Is this still an issue? Do we need to do any further work here?

bmunkholm · 2021-09-14T07:33:54Z

As there was already a support case, I'm assuming it was handled through that channel and therefore closing this @duncangroenewald. In case of further issues let us know and we will reopen.

realm-probot bot added the O-Community label Jan 16, 2021

jbreams self-assigned this Jan 25, 2021

bmunkholm closed this as completed Sep 14, 2021

github-actions bot locked as resolved and limited conversation to collaborators Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MongoDB Realm Cloud sync fails with "Bad sync process received (6)" error #3508

MongoDB Realm Cloud sync fails with "Bad sync process received (6)" error #3508

duncangroenewald commented Jan 16, 2021 •

edited by sync-by-unito bot

Loading

duncangroenewald commented Jan 19, 2021

fronck commented Jan 22, 2021

duncangroenewald commented Jan 22, 2021 •

edited

Loading

duncangroenewald commented Jan 22, 2021 •

edited

Loading

jbreams commented Jan 22, 2021

duncangroenewald commented Jan 22, 2021 •

edited

Loading

jbreams commented Jan 23, 2021

duncangroenewald commented Jan 23, 2021

duncangroenewald commented Jan 23, 2021

duncangroenewald commented Jan 23, 2021

duncangroenewald commented Jan 23, 2021

jbreams commented Jan 24, 2021

duncangroenewald commented Jan 24, 2021 •

edited

Loading

duncangroenewald commented Jan 27, 2021

kneth commented Sep 1, 2021

sync-by-unito bot commented Sep 13, 2021

bmunkholm commented Sep 14, 2021

MongoDB Realm Cloud sync fails with "Bad sync process received (6)" error #3508

MongoDB Realm Cloud sync fails with "Bad sync process received (6)" error #3508

Comments

duncangroenewald commented Jan 16, 2021 • edited by sync-by-unito bot Loading

Goals

Expected Results

Actual Results

Steps to Reproduce

Code Sample

Version of Realm and Tooling

duncangroenewald commented Jan 19, 2021

fronck commented Jan 22, 2021

duncangroenewald commented Jan 22, 2021 • edited Loading

duncangroenewald commented Jan 22, 2021 • edited Loading

jbreams commented Jan 22, 2021

duncangroenewald commented Jan 22, 2021 • edited Loading

jbreams commented Jan 23, 2021

duncangroenewald commented Jan 23, 2021

duncangroenewald commented Jan 23, 2021

duncangroenewald commented Jan 23, 2021

duncangroenewald commented Jan 23, 2021

jbreams commented Jan 24, 2021

duncangroenewald commented Jan 24, 2021 • edited Loading

duncangroenewald commented Jan 27, 2021

kneth commented Sep 1, 2021

sync-by-unito bot commented Sep 13, 2021

bmunkholm commented Sep 14, 2021

duncangroenewald commented Jan 16, 2021 •

edited by sync-by-unito bot

Loading

duncangroenewald commented Jan 22, 2021 •

edited

Loading

duncangroenewald commented Jan 22, 2021 •

edited

Loading

duncangroenewald commented Jan 22, 2021 •

edited

Loading

duncangroenewald commented Jan 24, 2021 •

edited

Loading