Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUST-1064 Retry on handshake failure #598

Merged
merged 6 commits into from
Mar 25, 2022

Conversation

abr-egn
Copy link
Contributor

@abr-egn abr-egn commented Mar 17, 2022

RUST-1064

This updates operation logic to retry if a connection fails with a network or authentication error. The test sync for this also brought a split to legacy and unified; none of the tests in legacy were modified.

Copy link
Contributor

@kmahar kmahar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like from the ticket there are also new retryable writes and txns tests to sync (though we are maybe supposed to skip the txn tests for now)

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! glad to see the existing pool cleared error retry mechanism ended up being the same for the more general one

@patrickfreed
Copy link
Contributor

From looking at the test failures, it looks like a load balancer test (the "errors during the initial connection hello are ignored" test in sdam-error-handling.yml) needs to be updated, though I'm not sure it actually ever was as part of the spec changes. We may need to file a DRIVERS ticket about that.

@kmahar
Copy link
Contributor

kmahar commented Mar 21, 2022

From looking at the test failures, it looks like a load balancer test (the "errors during the initial connection hello are ignored" test in sdam-error-handling.yml) needs to be updated, though I'm not sure it actually ever was as part of the spec changes. We may need to file a DRIVERS ticket about that.

it seems strange to me that Python, Java, and .NET are all still passing that test -- at least I'm guessing they are based on the fact they have merged in PRs and closed their tickets for this. I suggest asking in #driver-devs to try to figure out why only we have run into this so far

@abr-egn
Copy link
Contributor Author

abr-egn commented Mar 21, 2022

My theory is that I need to refine the Rust driver error-handling machinery to make a more precise distinction between error types.

@abr-egn abr-egn force-pushed the RUST-1064/retry-on-handshake-failure branch from 908e469 to e9b1ce7 Compare March 23, 2022 15:42
Copy link
Contributor Author

@abr-egn abr-egn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL.

.unwrap_or("Authentication failure"),
)),
Some(_) => {
let source = bson::from_bson::<CommandResponse<CommandErrorBody>>(Bson::Document(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When implementing this, I was a little surprised to learn that the SASL commands use bespoke parsing and validation rather than going through Serde integration like the rest; if there's no strong reason for that, it might be a good candidate for future cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a lot of the auth code is some of the oldest in the driver and actually predates the operation layer and much of the serde usage that exists today. I think it would definitely benefit from being cleaned up and updated. Could you file a ticket for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed RUST-1238.

@@ -331,16 +331,24 @@ impl Client {
Ok(c) => c,
Err(mut err) => {
err.add_labels_and_update_pin(None, &mut session, None)?;
if err.is_read_retryable() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little loopy but:

  • the spec mandates that the RETRYABLE_WRITE_ERROR label be added for retryable errors from authentication and
  • is_write_retryable() is implemented as a label check, so it can't be used to add the label here in the first place

Note that the csharp logic, which is called out in the spec as a reference implementation, ends up doing the same thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err.add_labels_and_update_pin calls through to should_add_retryable_write_label, which should take care of this for us, but it doesn't currently for two reasons:

  • we always pass in None for retryability
    • This can be fixed by creating a new Retryability variant called Always or something
  • more importantly, we don't have a connection and therefore don't know the max wire version of the server, so we don't know if we're supposed to add labels to it or should leave it to the server to do so

I think what we ideally need here is a single set of criteria for retrying handshake errors that is independent of whether the operation is a read or write or what the server version might be. It feels like it'll end up just being network errors + the union of the pre-4.4 retryable write + retryable read error codes. but I'm not sure.

With all that being said, I think that the solution you have here seems good until the spec gets that all fleshed out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be checking here if the client has retryWrites set to true before we add the label?

spec: "When the driver encounters a network error establishing an initial connection to a server,
it MUST add a RetryableWriteError label to that error if the MongoClient performing
the operation has the retryWrites configuration option set to true."

and I see C# checks context.RetryRequested there.

do any of the new tests cover retryWrites=false? (I'm not seeing any that do)

I guess either way it seems like we only actually do retry if the client has it set given the check in get_op_retryability, but I'm wondering if there could be any ill effects or user confusion from adding the label unconditionally here -- the retryable writes spec discusses that a bit here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - updated to check for retryWrites here.

@@ -824,35 +832,49 @@ impl Client {
}

/// Returns the retryability level for the execution of this operation.
async fn get_retryability<T: Operation>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out this didn't need to be async.

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me as-is, but I'm wondering if we should hold off for a bit pending some further spec changes that might come as a result of the issues we've faced here. I left a comment on DRIVERS-746 about it.

.unwrap_or("Authentication failure"),
)),
Some(_) => {
let source = bson::from_bson::<CommandResponse<CommandErrorBody>>(Bson::Document(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a lot of the auth code is some of the oldest in the driver and actually predates the operation layer and much of the serde usage that exists today. I think it would definitely benefit from being cleaned up and updated. Could you file a ticket for that?

@@ -331,16 +331,24 @@ impl Client {
Ok(c) => c,
Err(mut err) => {
err.add_labels_and_update_pin(None, &mut session, None)?;
if err.is_read_retryable() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err.add_labels_and_update_pin calls through to should_add_retryable_write_label, which should take care of this for us, but it doesn't currently for two reasons:

  • we always pass in None for retryability
    • This can be fixed by creating a new Retryability variant called Always or something
  • more importantly, we don't have a connection and therefore don't know the max wire version of the server, so we don't know if we're supposed to add labels to it or should leave it to the server to do so

I think what we ideally need here is a single set of criteria for retrying handshake errors that is independent of whether the operation is a read or write or what the server version might be. It feels like it'll end up just being network errors + the union of the pre-4.4 retryable write + retryable read error codes. but I'm not sure.

With all that being said, I think that the solution you have here seems good until the spec gets that all fleshed out.

@abr-egn abr-egn force-pushed the RUST-1064/retry-on-handshake-failure branch from da1c1d3 to 7749c95 Compare March 25, 2022 15:22
@abr-egn abr-egn merged commit 2f12793 into mongodb:master Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants