Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging to catch testing race-condition #3615

Merged
merged 7 commits into from
Jul 11, 2022
Merged

Conversation

itaiad200
Copy link
Contributor

@itaiad200 itaiad200 commented Jul 5, 2022

Reenabling and adding logging for #3428

I retry the test >10 times and it didn't fail. Esti's history shows that it failed twice in the recent 150 runs, so we need it enabled (with better logging) if we want to catch that race condition.

@itaiad200 itaiad200 added bug Something isn't working area/system-tests Improvements or additions to the system-tests exclude-changelog PR description should not be included in next release changelog team/versioning-engine Team versioning engine labels Jul 5, 2022
@itaiad200 itaiad200 requested review from N-o-Z and arielshaqed July 5, 2022 11:25
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by the disparity between what this code does and what the linked code does -- they seem to handle "connection refused" errors very differently from one another.

// and whether it is safe to retry -
// https://github.com/aws/aws-sdk-go/pull/2926#issuecomment-553637658.
//
// In lakeFS all operations with s3 (read, write, list) are considered idempotent,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to ensure that only one of these operations is being performed? Otherwise someone might add a new operation and not think to avoid it being retried.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not an exhaustive list, other operations may also benefit from it. I'll update the comment to reflect that.

Notice that even before this change, the caller must always assumes that a retry may happen. This change only adds one more case where a specific tcp error is retried.

// github.com/aws/aws-sdk-go/aws/request/connection_reset_error.go. This is
// unfortunate but the only solution until the SDK exposes a specialized error
// code or type for this class of errors.
return err != nil && strings.Contains(err.Error(), "read: connection reset")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is not only strange, it also looks like the opposite of isConnectionReset. For some reason, AWS consider a "connection reset" to be a connectionReset only if it is not a "read: connection reset".

I don't understand it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yap - I think the linked code is confusing. The appropriate function name should be isErrConnectionResetAndShouldRetry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do something like what this StackOverflow answer suggests? It should work if AWS errors wrap OS errors (in the golang sense) -- and if it is good, it is much better than looking at the error message.

Also worried about the "read:" prefix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look at AWSErr, and it seems not to wrap. But you can try to cast to an AWSError, and if that succeeds call OrigErr. I don't see OrigErr as returning the BatchError case except for errors from the S3 manager, so probably safe to ignore that.

So something like

if awsErr, ok := err.(awserr.Error) {
	err = awsErr.OrigErr()
}
return errors.Is(err, syscall.ECONNRESET)

might work!!?!
}

@itaiad200 itaiad200 requested a review from arielshaqed July 5, 2022 13:48
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can you explain what the scenario this retrier fixes in the flaky test? Why is the service under test stopping before the test is done?

// github.com/aws/aws-sdk-go/aws/request/connection_reset_error.go. This is
// unfortunate but the only solution until the SDK exposes a specialized error
// code or type for this class of errors.
return err != nil && strings.Contains(err.Error(), "read: connection reset")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do something like what this StackOverflow answer suggests? It should work if AWS errors wrap OS errors (in the golang sense) -- and if it is good, it is much better than looking at the error message.

Also worried about the "read:" prefix.

@@ -0,0 +1,41 @@
package s3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a typo in the filename

@itaiad200 itaiad200 force-pushed the 3428-import-flaky branch from 51b8285 to 72782d7 Compare July 7, 2022 13:39
@itaiad200 itaiad200 force-pushed the 3428-import-flaky branch from 72782d7 to 4704935 Compare July 7, 2022 18:26
@itaiad200 itaiad200 changed the title Fix flaky import test by retrying on connection reset by peer Add logging to catch testing race-condition Jul 10, 2022
@itaiad200 itaiad200 requested review from arielshaqed and N-o-Z July 10, 2022 08:14
@N-o-Z
Copy link
Member

N-o-Z commented Jul 10, 2022

@itaiad200 Did you see TestPyramidWriteFile failure?

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

  1. Note that logging adds a lot of synchronization that can destroy many race conditions.
  2. I would add some field to the loggers passed in (say test: true or something) to make it easier to see their result.

Rather than push it onto master, could we perhaps just run it 30 times and see if we get a failure with interesting logs?

@@ -29,7 +28,8 @@ func TestPyramidWriteFile(t *testing.T) {
abortCalled := false
var storeCtx context.Context
sut := WRFile{
File: fh,
File: fh,
logger: logging.Default(),
store: func(innerCtx context.Context, _ string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird gofmt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned to the right 🤷‍♂️

@itaiad200
Copy link
Contributor Author

@itaiad200 Did you see TestPyramidWriteFile failure?

Nope, this location where I added logs is the context I'm afraid we're cancelling prematurely.

@itaiad200
Copy link
Contributor Author

Thanks!

  1. Note that logging adds a lot of synchronization that can destroy many race conditions.
  2. I would add some field to the loggers passed in (say test: true or something) to make it easier to see their result.

Rather than push it onto master, could we perhaps just run it 30 times and see if we get a failure with interesting logs?

It's the Esti test that has flakiness, not TestPyramidWriteFile. I wrote about the previous runs (2 failures in ~150 runs), I think that this log will help us catch it the next time.

@itaiad200 itaiad200 merged commit f1ce8e3 into master Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/system-tests Improvements or additions to the system-tests bug Something isn't working exclude-changelog PR description should not be included in next release changelog team/versioning-engine Team versioning engine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants