RUST-1064 Retry on handshake failure #598

abr-egn · 2022-03-17T19:49:12Z

This updates operation logic to retry if a connection fails with a network or authentication error. The test sync for this also brought a split to legacy and unified; none of the tests in legacy were modified.

kmahar

it looks like from the ticket there are also new retryable writes and txns tests to sync (though we are maybe supposed to skip the txn tests for now)

src/client/executor.rs

patrickfreed

LGTM! glad to see the existing pool cleared error retry mechanism ended up being the same for the more general one

patrickfreed · 2022-03-18T21:09:58Z

From looking at the test failures, it looks like a load balancer test (the "errors during the initial connection hello are ignored" test in sdam-error-handling.yml) needs to be updated, though I'm not sure it actually ever was as part of the spec changes. We may need to file a DRIVERS ticket about that.

kmahar · 2022-03-21T16:42:07Z

From looking at the test failures, it looks like a load balancer test (the "errors during the initial connection hello are ignored" test in sdam-error-handling.yml) needs to be updated, though I'm not sure it actually ever was as part of the spec changes. We may need to file a DRIVERS ticket about that.

it seems strange to me that Python, Java, and .NET are all still passing that test -- at least I'm guessing they are based on the fact they have merged in PRs and closed their tickets for this. I suggest asking in #driver-devs to try to figure out why only we have run into this so far

abr-egn · 2022-03-21T17:22:08Z

My theory is that I need to refine the Rust driver error-handling machinery to make a more precise distinction between error types.

abr-egn

Updated, PTAL.

abr-egn · 2022-03-23T15:47:39Z

src/client/auth/sasl.rs

-                .unwrap_or("Authentication failure"),
-        )),
+        Some(_) => {
+            let source = bson::from_bson::<CommandResponse<CommandErrorBody>>(Bson::Document(


When implementing this, I was a little surprised to learn that the SASL commands use bespoke parsing and validation rather than going through Serde integration like the rest; if there's no strong reason for that, it might be a good candidate for future cleanup.

Yeah, a lot of the auth code is some of the oldest in the driver and actually predates the operation layer and much of the serde usage that exists today. I think it would definitely benefit from being cleaned up and updated. Could you file a ticket for that?

Filed RUST-1238.

abr-egn · 2022-03-23T15:50:17Z

src/client/executor.rs

@@ -331,16 +331,24 @@ impl Client {
            Ok(c) => c,
            Err(mut err) => {
                err.add_labels_and_update_pin(None, &mut session, None)?;
+                if err.is_read_retryable() {


This seems a little loopy but:

the spec mandates that the RETRYABLE_WRITE_ERROR label be added for retryable errors from authentication and

is_write_retryable() is implemented as a label check, so it can't be used to add the label here in the first place

Note that the csharp logic, which is called out in the spec as a reference implementation, ends up doing the same thing.

err.add_labels_and_update_pin calls through to should_add_retryable_write_label, which should take care of this for us, but it doesn't currently for two reasons:

we always pass in None for retryability

This can be fixed by creating a new Retryability variant called Always or something

more importantly, we don't have a connection and therefore don't know the max wire version of the server, so we don't know if we're supposed to add labels to it or should leave it to the server to do so

@matthewdale points out the same issue as experienced in the Go driver here.

I think what we ideally need here is a single set of criteria for retrying handshake errors that is independent of whether the operation is a read or write or what the server version might be. It feels like it'll end up just being network errors + the union of the pre-4.4 retryable write + retryable read error codes. but I'm not sure.

With all that being said, I think that the solution you have here seems good until the spec gets that all fleshed out.

should we be checking here if the client has retryWrites set to true before we add the label?

spec: "When the driver encounters a network error establishing an initial connection to a server,
it MUST add a RetryableWriteError label to that error if the MongoClient performing
the operation has the retryWrites configuration option set to true."

and I see C# checks context.RetryRequested there.

do any of the new tests cover retryWrites=false? (I'm not seeing any that do)

I guess either way it seems like we only actually do retry if the client has it set given the check in get_op_retryability, but I'm wondering if there could be any ill effects or user confusion from adding the label unconditionally here -- the retryable writes spec discusses that a bit here

Good catch - updated to check for retryWrites here.

abr-egn · 2022-03-23T15:53:42Z

src/client/executor.rs

@@ -824,35 +832,49 @@ impl Client {
    }

    /// Returns the retryability level for the execution of this operation.
-    async fn get_retryability<T: Operation>(


It turns out this didn't need to be async.

patrickfreed

The code looks good to me as-is, but I'm wondering if we should hold off for a bit pending some further spec changes that might come as a result of the issues we've faced here. I left a comment on DRIVERS-746 about it.

patrickfreed · 2022-03-23T18:55:13Z

src/client/auth/sasl.rs

-                .unwrap_or("Authentication failure"),
-        )),
+        Some(_) => {
+            let source = bson::from_bson::<CommandResponse<CommandErrorBody>>(Bson::Document(


Yeah, a lot of the auth code is some of the oldest in the driver and actually predates the operation layer and much of the serde usage that exists today. I think it would definitely benefit from being cleaned up and updated. Could you file a ticket for that?

patrickfreed · 2022-03-23T19:27:58Z

src/client/executor.rs

@@ -331,16 +331,24 @@ impl Client {
            Ok(c) => c,
            Err(mut err) => {
                err.add_labels_and_update_pin(None, &mut session, None)?;
+                if err.is_read_retryable() {


err.add_labels_and_update_pin calls through to should_add_retryable_write_label, which should take care of this for us, but it doesn't currently for two reasons:

we always pass in None for retryability

This can be fixed by creating a new Retryability variant called Always or something

more importantly, we don't have a connection and therefore don't know the max wire version of the server, so we don't know if we're supposed to add labels to it or should leave it to the server to do so

@matthewdale points out the same issue as experienced in the Go driver here.

I think what we ideally need here is a single set of criteria for retrying handshake errors that is independent of whether the operation is a read or write or what the server version might be. It feels like it'll end up just being network errors + the union of the pre-4.4 retryable write + retryable read error codes. but I'm not sure.

With all that being said, I think that the solution you have here seems good until the spec gets that all fleshed out.

abr-egn requested review from patrickfreed, kmahar and isabelatkinson March 17, 2022 19:49

kmahar reviewed Mar 17, 2022

View reviewed changes

src/client/executor.rs Show resolved Hide resolved

patrickfreed approved these changes Mar 18, 2022

View reviewed changes

abr-egn force-pushed the RUST-1064/retry-on-handshake-failure branch from 908e469 to e9b1ce7 Compare March 23, 2022 15:42

abr-egn commented Mar 23, 2022

View reviewed changes

patrickfreed reviewed Mar 23, 2022

View reviewed changes

patrickfreed approved these changes Mar 24, 2022

View reviewed changes

kmahar approved these changes Mar 24, 2022

View reviewed changes

isabelatkinson approved these changes Mar 24, 2022

View reviewed changes

abr-egn added 6 commits March 25, 2022 11:22

update connection retry to any network error

0a87587

sync tests

cb344f0

also retry on auth

4cb0055

sync retryable writes test

370ab20

bring retry logic into spec

5cf1932

check for retry_writes before ading retryable label

7749c95

abr-egn force-pushed the RUST-1064/retry-on-handshake-failure branch from da1c1d3 to 7749c95 Compare March 25, 2022 15:22

abr-egn merged commit 2f12793 into mongodb:master Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUST-1064 Retry on handshake failure #598

RUST-1064 Retry on handshake failure #598

abr-egn commented Mar 17, 2022

kmahar left a comment

patrickfreed left a comment

patrickfreed commented Mar 18, 2022

kmahar commented Mar 21, 2022

abr-egn commented Mar 21, 2022

abr-egn left a comment

abr-egn Mar 23, 2022

patrickfreed Mar 23, 2022

abr-egn Mar 24, 2022

abr-egn Mar 23, 2022

patrickfreed Mar 23, 2022

kmahar Mar 23, 2022

abr-egn Mar 24, 2022

abr-egn Mar 23, 2022

patrickfreed left a comment

patrickfreed Mar 23, 2022

patrickfreed Mar 23, 2022

RUST-1064 Retry on handshake failure #598

RUST-1064 Retry on handshake failure #598

Conversation

abr-egn commented Mar 17, 2022

kmahar left a comment

Choose a reason for hiding this comment

patrickfreed left a comment

Choose a reason for hiding this comment

patrickfreed commented Mar 18, 2022

kmahar commented Mar 21, 2022

abr-egn commented Mar 21, 2022

abr-egn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickfreed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment