-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUST-1064 Retry on handshake failure #598
RUST-1064 Retry on handshake failure #598
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like from the ticket there are also new retryable writes and txns tests to sync (though we are maybe supposed to skip the txn tests for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! glad to see the existing pool cleared error retry mechanism ended up being the same for the more general one
From looking at the test failures, it looks like a load balancer test (the "errors during the initial connection hello are ignored" test in |
it seems strange to me that Python, Java, and .NET are all still passing that test -- at least I'm guessing they are based on the fact they have merged in PRs and closed their tickets for this. I suggest asking in #driver-devs to try to figure out why only we have run into this so far |
My theory is that I need to refine the Rust driver error-handling machinery to make a more precise distinction between error types. |
908e469
to
e9b1ce7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, PTAL.
.unwrap_or("Authentication failure"), | ||
)), | ||
Some(_) => { | ||
let source = bson::from_bson::<CommandResponse<CommandErrorBody>>(Bson::Document( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When implementing this, I was a little surprised to learn that the SASL commands use bespoke parsing and validation rather than going through Serde integration like the rest; if there's no strong reason for that, it might be a good candidate for future cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a lot of the auth code is some of the oldest in the driver and actually predates the operation layer and much of the serde usage that exists today. I think it would definitely benefit from being cleaned up and updated. Could you file a ticket for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed RUST-1238.
src/client/executor.rs
Outdated
@@ -331,16 +331,24 @@ impl Client { | |||
Ok(c) => c, | |||
Err(mut err) => { | |||
err.add_labels_and_update_pin(None, &mut session, None)?; | |||
if err.is_read_retryable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little loopy but:
- the spec mandates that the
RETRYABLE_WRITE_ERROR
label be added for retryable errors from authentication and is_write_retryable()
is implemented as a label check, so it can't be used to add the label here in the first place
Note that the csharp logic, which is called out in the spec as a reference implementation, ends up doing the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err.add_labels_and_update_pin
calls through to should_add_retryable_write_label
, which should take care of this for us, but it doesn't currently for two reasons:
- we always pass in
None
for retryability- This can be fixed by creating a new
Retryability
variant calledAlways
or something
- This can be fixed by creating a new
- more importantly, we don't have a connection and therefore don't know the max wire version of the server, so we don't know if we're supposed to add labels to it or should leave it to the server to do so
- @matthewdale points out the same issue as experienced in the Go driver here.
I think what we ideally need here is a single set of criteria for retrying handshake errors that is independent of whether the operation is a read or write or what the server version might be. It feels like it'll end up just being network errors + the union of the pre-4.4 retryable write + retryable read error codes. but I'm not sure.
With all that being said, I think that the solution you have here seems good until the spec gets that all fleshed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be checking here if the client has retryWrites
set to true before we add the label?
spec: "When the driver encounters a network error establishing an initial connection to a server,
it MUST add a RetryableWriteError label to that error if the MongoClient performing
the operation has the retryWrites configuration option set to true."
and I see C# checks context.RetryRequested
there.
do any of the new tests cover retryWrites=false? (I'm not seeing any that do)
I guess either way it seems like we only actually do retry if the client has it set given the check in get_op_retryability
, but I'm wondering if there could be any ill effects or user confusion from adding the label unconditionally here -- the retryable writes spec discusses that a bit here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - updated to check for retryWrites
here.
@@ -824,35 +832,49 @@ impl Client { | |||
} | |||
|
|||
/// Returns the retryability level for the execution of this operation. | |||
async fn get_retryability<T: Operation>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out this didn't need to be async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me as-is, but I'm wondering if we should hold off for a bit pending some further spec changes that might come as a result of the issues we've faced here. I left a comment on DRIVERS-746 about it.
.unwrap_or("Authentication failure"), | ||
)), | ||
Some(_) => { | ||
let source = bson::from_bson::<CommandResponse<CommandErrorBody>>(Bson::Document( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a lot of the auth code is some of the oldest in the driver and actually predates the operation layer and much of the serde usage that exists today. I think it would definitely benefit from being cleaned up and updated. Could you file a ticket for that?
src/client/executor.rs
Outdated
@@ -331,16 +331,24 @@ impl Client { | |||
Ok(c) => c, | |||
Err(mut err) => { | |||
err.add_labels_and_update_pin(None, &mut session, None)?; | |||
if err.is_read_retryable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err.add_labels_and_update_pin
calls through to should_add_retryable_write_label
, which should take care of this for us, but it doesn't currently for two reasons:
- we always pass in
None
for retryability- This can be fixed by creating a new
Retryability
variant calledAlways
or something
- This can be fixed by creating a new
- more importantly, we don't have a connection and therefore don't know the max wire version of the server, so we don't know if we're supposed to add labels to it or should leave it to the server to do so
- @matthewdale points out the same issue as experienced in the Go driver here.
I think what we ideally need here is a single set of criteria for retrying handshake errors that is independent of whether the operation is a read or write or what the server version might be. It feels like it'll end up just being network errors + the union of the pre-4.4 retryable write + retryable read error codes. but I'm not sure.
With all that being said, I think that the solution you have here seems good until the spec gets that all fleshed out.
da1c1d3
to
7749c95
Compare
RUST-1064
This updates operation logic to retry if a connection fails with a network or authentication error. The test sync for this also brought a split to
legacy
andunified
; none of the tests inlegacy
were modified.