-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MongoDB Realm Cloud sync fails with "Bad sync process received (6)" error #3508
Comments
FYI - I was unable to initialise a new client yesterday as it was failing to open the realm and start the initial sync (download data to the client). The client was sitting in an active state this morning and suddenly started working. Deleting the client and restarting it worked in subsequent attempts so perhaps there was a cloud sync service problem that was fixed today. I am going to delete the Realm Cloud App and rerun the load client to see if anything has changed during the sync of the client loaded data (i.e. upload to cloud service). |
@duncangroenewald Thanks for bringing this to our attention. I have reached out to our backend/sync teams to dig into this. Would you have any logs from the cloud side from the time that you saw the error? |
@fronck - every time I run the test I seem to get a different result. As far as I can recall there was no error showing in the log but I do see a lot of errors relating to "error integrating changeset ....". There are a lot of these errors:
with server side logs looking like this: So far I have not had much success with reliable syncing - sync seems to crash before the client has completed uploading all change sets. I am assuming Sync is more like in Alpha state than in Beta state. |
Here is another set of logs from running the test this morning.
Javascript client stalls at this point - well no more console output is generated by the sync progress report callback. But it seems the client is still syncing data to the server even though the client no longer creates any log output to indicate the upload status. And some minutes later there is a disconnect. |
Hello @duncangroenewald, thank you for these logs and updates. The
At this point I think you're hitting a real bug somewhere on our end - I have a working theory that there's a bad interaction between the sync client's keepalive PING/PONG message timeouts and long-running upload integration attempts. Unfortunately the sync logs are not at a high enough verbosity to see exactly what's going on on the client side. Could you set the sync client's log level to "all", rerun your migration script, and upload the whole output to this secure uploader link? You can do this by adding Thanks for your patience on this! |
@jbreams - I have uploaded the js script, a RealmSwift sample app and the source data file as part of a support request so you could set it up and run it yourself. In the meantime I will run it again with log level to all and save the output to a file. Note that I have to terminate sync and delete the Atlas database and then re-enable sync on the Realm App to clear out the data. I can try setting up another tier of Atlas but I did try that at one stage and got the same result if I remember. In a few months of trying I think I have never managed to get a complete sync without having to restart the client after these "Bad sync state" errors. Ideally you should try running the script on your side to see if you can replicate the issues. Feel free to use my Atlas account - just let me know so I don't try using it at the same time. For now I will rerun and save the log output from the client for you. |
@duncangroenewald, I've actually run the sample JS script you provided with the source data file several times without seeing this error before asking you for more log output. What geographical region are you running your script and app in? When I re-ran your script to try to troubleshoot this problem, I ran it against an M0 shared tier Atlas cluster in the us-east-1 region like yours, but I'm pretty close to there and have pretty low latency - so maybe that's a change I should make to get a more accurate reproduction. Also, when you set up your Realm app, did you select a global or local deployment? Either way, if this error happens very consistently for you, getting a log with trace-level log output would be very helpful to correlate the problems you're seeing with the debug logging we have on the backend. |
OK, I just uploaded the log file to the link above. Lots of server errors like this "Error: Failed to integrate download after attempting the maximum number of tries: retryable error while committing integrated changesets: (WriteConflict) WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction. Error syncing MongoDB write I am running on macOS 11 M1 Rosetta so will do the same test on Intel just to be sure - I did in the past and got similar errors, if not the same ones but will retest to be sure. |
Oh and I am based in Melbourne Australia. |
Same errors when running on macOS 11 Intel. |
@duncangroenewald, thank you for uploading your logs and giving some details about where you are geographically, they definitely filled in a lot of holes in what's going on here. Basically I think your current configuration is basically the worst case scenario in terms of latency. Because your realm app is deployed "globally" - this was an option that was selected when the Realm app was first created - your database server and app server are as physically far apart as it is possible to be on the earth - with your atlas cluster in Northern VA in the US and your app server in Sydney. Integrating uploads requires a bunch of database round trips to check for conflicts and then do conflict resolution. Looking at the logs you uploaded, the fastest round-trip time from your migrate script to the sync server is ~24ms (which implies you're physically close to the sync server), but as soon as the sync server has to do any db ops the round-trip time goes up to ~5000ms at a minimum. This very high latency and long upload processing time is interacting with a bug and some bad assumptions on the client and server about how to deal with timeouts. The sequence of events is: the migrate script uploads a whole bunch of changes to integrate and also sends a PING message to the server - that takes a very long time since it involves a ton of db round trips around the world. In the meantime, the client times out waiting for a PONG response from the server because the server has been spending all its time sending db ops around the planet. So the client decides the server is down and opens a new connection and starts uploading again. However the server is still working through the backlog of messages that were sent from the original connection, and to compound things the new messages from the new connection actually have conflicts with the messages from the original connection so that each uploaded set of data must be retried a number of times before succeeding. This cycle continues until an upload message from the original connection succeeds and is out-of-order with respect to an upload message from the new connection and that corrupts the state of the connection which causes the The errors about "timed out after 10 attempts to integrate downloaded changesets" and "NoSuchTransaction" also seems to stem from very high latency between the app server and atlas cluster. This is definitely a bug in both the sync server and sync client that we'll work to address ASAP. I'm also proposing internally that we add some warnings or totally disallow this specific configuration since it's definitely a very easy foot gun. In the meantime, I think if you deleted and re-created your Realm app as a Local app instead of a Global app (so that the atlas cluster and app server have very low latency between each other), a lot of your problems might just go away. You'd have high latency between you and the app server, but each operation on the app server should be able to complete much more quickly. Alternatively you could launch an Atlas cluster in the Sydney AWS region, but then it wouldn't be on the free tier. Hopefully this is clear - let me know if you have any more questions, and thanks again for your patience. |
That makes sense. I will try configuring the local app and see how that goes and also test a Sydney region cluster. Just did a quick test with a local app and the initial load completes in about 30 seconds so that's looking promising. Some warnings on the global deployment might be a good idea. |
There still seem to be some issues with server errors showing up after connecting the client. The client appears to be working but further testing is required to tell if these errors have any impact on the client functions or data. Details are attached to the support request. |
A fix might be realm/realm-core#4878 |
➤ Jonathan Reams commented: I think this may have fallen through the cracks of scheduling and got conflated with several other issues. Is this still an issue? Do we need to do any further work here? |
As there was already a support case, I'm assuming it was handled through that channel and therefore closing this @duncangroenewald. In case of further issues let us know and we will reopen. |
Goals
Load data into a synced MongoDB Realm App
Expected Results
Script loads data into the synced MongoDB Realm App and data is synced with MongoDB Atlas and can be downloaded by another Realm client application
Actual Results
Script runs and data load completes (into the local realm file) but sync process fails - see errors below.
Restarting the script in query mode (does not load data but queries the record count for each object type) appears to resume the sync process.
The sync performance also seems much faster than it was when running the same script yesterday so not sure if some server side improvements have been made recently as well. They sync completed successfully after the script was restarted once - previous attempts have seem multiple "Bad sync process received" failures and required multiple restarts of the client script.
I also tested with a new client application and this appears to now be working - the client app successfully downloads the synced realm and reports the correct record counts.
There appear to be no errors in the MongoDB Realm App logs
and the last write details
Steps to Reproduce
See issue #3503 (comment)
Code Sample
See issue #3503 (comment)
Note that I have raised a support case and fun script and sample data is attached to that.
Version of Realm and Tooling
The text was updated successfully, but these errors were encountered: