-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logging to catch testing race-condition #3615
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by the disparity between what this code does and what the linked code does -- they seem to handle "connection refused" errors very differently from one another.
pkg/block/s3/retyer.go
Outdated
// and whether it is safe to retry - | ||
// https://github.com/aws/aws-sdk-go/pull/2926#issuecomment-553637658. | ||
// | ||
// In lakeFS all operations with s3 (read, write, list) are considered idempotent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to ensure that only one of these operations is being performed? Otherwise someone might add a new operation and not think to avoid it being retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not an exhaustive list, other operations may also benefit from it. I'll update the comment to reflect that.
Notice that even before this change, the caller must always assumes that a retry may happen. This change only adds one more case where a specific tcp error is retried.
pkg/block/s3/retyer.go
Outdated
// github.com/aws/aws-sdk-go/aws/request/connection_reset_error.go. This is | ||
// unfortunate but the only solution until the SDK exposes a specialized error | ||
// code or type for this class of errors. | ||
return err != nil && strings.Contains(err.Error(), "read: connection reset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is not only strange, it also looks like the opposite of isConnectionReset
. For some reason, AWS consider a "connection reset" to be a connectionReset only if it is not a "read: connection reset".
I don't understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yap - I think the linked code is confusing. The appropriate function name should be isErrConnectionResetAndShouldRetry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do something like what this StackOverflow answer suggests? It should work if AWS errors wrap OS errors (in the golang sense) -- and if it is good, it is much better than looking at the error message.
Also worried about the "read:" prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look at AWSErr, and it seems not to wrap. But you can try to cast to an AWSError, and if that succeeds call OrigErr
. I don't see OrigErr as returning the BatchError case except for errors from the S3 manager, so probably safe to ignore that.
So something like
if awsErr, ok := err.(awserr.Error) {
err = awsErr.OrigErr()
}
return errors.Is(err, syscall.ECONNRESET)
might work!!?!
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can you explain what the scenario this retrier fixes in the flaky test? Why is the service under test stopping before the test is done?
pkg/block/s3/retyer.go
Outdated
// github.com/aws/aws-sdk-go/aws/request/connection_reset_error.go. This is | ||
// unfortunate but the only solution until the SDK exposes a specialized error | ||
// code or type for this class of errors. | ||
return err != nil && strings.Contains(err.Error(), "read: connection reset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do something like what this StackOverflow answer suggests? It should work if AWS errors wrap OS errors (in the golang sense) -- and if it is good, it is much better than looking at the error message.
Also worried about the "read:" prefix.
pkg/block/s3/retyer.go
Outdated
@@ -0,0 +1,41 @@ | |||
package s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a typo in the filename
51b8285
to
72782d7
Compare
72782d7
to
4704935
Compare
@itaiad200 Did you see TestPyramidWriteFile failure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
- Note that logging adds a lot of synchronization that can destroy many race conditions.
- I would add some field to the loggers passed in (say
test: true
or something) to make it easier to see their result.
Rather than push it onto master
, could we perhaps just run it 30 times and see if we get a failure with interesting logs?
@@ -29,7 +28,8 @@ func TestPyramidWriteFile(t *testing.T) { | |||
abortCalled := false | |||
var storeCtx context.Context | |||
sut := WRFile{ | |||
File: fh, | |||
File: fh, | |||
logger: logging.Default(), | |||
store: func(innerCtx context.Context, _ string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird gofmt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aligned to the right 🤷♂️
Nope, this location where I added logs is the context I'm afraid we're cancelling prematurely. |
It's the Esti test that has flakiness, not |
Reenabling and adding logging for #3428
I retry the test >10 times and it didn't fail. Esti's history shows that it failed twice in the recent 150 runs, so we need it enabled (with better logging) if we want to catch that race condition.