Add logging to catch testing race-condition #3615

itaiad200 · 2022-07-05T11:25:54Z

Reenabling and adding logging for #3428

I retry the test >10 times and it didn't fail. Esti's history shows that it failed twice in the recent 150 runs, so we need it enabled (with better logging) if we want to catch that race condition.

arielshaqed

I am confused by the disparity between what this code does and what the linked code does -- they seem to handle "connection refused" errors very differently from one another.

arielshaqed · 2022-07-05T12:06:55Z

pkg/block/s3/retyer.go

+// and whether it is safe to retry -
+// https://github.com/aws/aws-sdk-go/pull/2926#issuecomment-553637658.
+//
+// In lakeFS all operations with s3 (read, write, list) are considered idempotent,


Would it make sense to ensure that only one of these operations is being performed? Otherwise someone might add a new operation and not think to avoid it being retried.

That's not an exhaustive list, other operations may also benefit from it. I'll update the comment to reflect that.

Notice that even before this change, the caller must always assumes that a retry may happen. This change only adds one more case where a specific tcp error is retried.

arielshaqed · 2022-07-05T12:09:55Z

pkg/block/s3/retyer.go

+	// github.com/aws/aws-sdk-go/aws/request/connection_reset_error.go. This is
+	// unfortunate but the only solution until the SDK exposes a specialized error
+	// code or type for this class of errors.
+	return err != nil && strings.Contains(err.Error(), "read: connection reset")


This condition is not only strange, it also looks like the opposite of isConnectionReset. For some reason, AWS consider a "connection reset" to be a connectionReset only if it is not a "read: connection reset".

I don't understand it.

Yap - I think the linked code is confusing. The appropriate function name should be isErrConnectionResetAndShouldRetry

Can you do something like what this StackOverflow answer suggests? It should work if AWS errors wrap OS errors (in the golang sense) -- and if it is good, it is much better than looking at the error message.

Also worried about the "read:" prefix.

Had a look at AWSErr, and it seems not to wrap. But you can try to cast to an AWSError, and if that succeeds call OrigErr. I don't see OrigErr as returning the BatchError case except for errors from the S3 manager, so probably safe to ignore that.

So something like

if awsErr, ok := err.(awserr.Error) { err = awsErr.OrigErr() } return errors.Is(err, syscall.ECONNRESET)

might work!!?!
}

arielshaqed

Actually, can you explain what the scenario this retrier fixes in the flaky test? Why is the service under test stopping before the test is done?

arielshaqed · 2022-07-05T15:45:02Z

pkg/block/s3/retyer.go

+	// github.com/aws/aws-sdk-go/aws/request/connection_reset_error.go. This is
+	// unfortunate but the only solution until the SDK exposes a specialized error
+	// code or type for this class of errors.
+	return err != nil && strings.Contains(err.Error(), "read: connection reset")


Can you do something like what this StackOverflow answer suggests? It should work if AWS errors wrap OS errors (in the golang sense) -- and if it is good, it is much better than looking at the error message.

Also worried about the "read:" prefix.

N-o-Z · 2022-07-05T15:48:27Z

pkg/block/s3/retyer.go

@@ -0,0 +1,41 @@
+package s3


You have a typo in the filename

N-o-Z · 2022-07-10T08:26:22Z

@itaiad200 Did you see TestPyramidWriteFile failure?

arielshaqed

Thanks!

Note that logging adds a lot of synchronization that can destroy many race conditions.
I would add some field to the loggers passed in (say test: true or something) to make it easier to see their result.

Rather than push it onto master, could we perhaps just run it 30 times and see if we get a failure with interesting logs?

arielshaqed · 2022-07-10T16:37:39Z

pkg/pyramid/file_test.go

@@ -29,7 +28,8 @@ func TestPyramidWriteFile(t *testing.T) {
 	abortCalled := false
 	var storeCtx context.Context
 	sut := WRFile{
-		File: fh,
+		File:   fh,
+		logger: logging.Default(),
 		store: func(innerCtx context.Context, _ string) error {


weird gofmt.

Aligned to the right 🤷‍♂️

itaiad200 · 2022-07-11T08:36:45Z

@itaiad200 Did you see TestPyramidWriteFile failure?

Nope, this location where I added logs is the context I'm afraid we're cancelling prematurely.

itaiad200 · 2022-07-11T08:40:05Z

Thanks!

Note that logging adds a lot of synchronization that can destroy many race conditions.

I would add some field to the loggers passed in (say test: true or something) to make it easier to see their result.

Rather than push it onto master, could we perhaps just run it 30 times and see if we get a failure with interesting logs?

It's the Esti test that has flakiness, not TestPyramidWriteFile. I wrote about the previous runs (2 failures in ~150 runs), I think that this log will help us catch it the next time.

itaiad200 added bug Something isn't working area/system-tests Improvements or additions to the system-tests exclude-changelog PR description should not be included in next release changelog team/versioning-engine Team versioning engine labels Jul 5, 2022

itaiad200 requested review from N-o-Z and arielshaqed July 5, 2022 11:25

arielshaqed requested changes Jul 5, 2022

View reviewed changes

itaiad200 requested a review from arielshaqed July 5, 2022 13:48

arielshaqed reviewed Jul 5, 2022

View reviewed changes

N-o-Z reviewed Jul 5, 2022

View reviewed changes

pkg/block/s3/retyer.go Outdated

@@ -0,0 +1,41 @@

package s3

Copy link

Member

N-o-Z Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a typo in the filename

itaiad200 force-pushed the 3428-import-flaky branch from 51b8285 to 72782d7 Compare July 7, 2022 13:39

itaiad200 added 5 commits July 7, 2022 21:26

Fix flaky import test by retrying on connection reset by peer

b3a4715

Fix comments

4879fee

Try to have logging around context cancellation

525c07c

Try to have logging around context cancellation

a1906ad

unskip the test

4704935

itaiad200 force-pushed the 3428-import-flaky branch from 72782d7 to 4704935 Compare July 7, 2022 18:26

Add logging to context cancelling

f400e6d

itaiad200 changed the title ~~Fix flaky import test by retrying on connection reset by peer~~ Add logging to catch testing race-condition Jul 10, 2022

itaiad200 requested review from arielshaqed and N-o-Z July 10, 2022 08:14

N-o-Z approved these changes Jul 10, 2022

View reviewed changes

Add logger to tests

a693ec5

arielshaqed approved these changes Jul 10, 2022

View reviewed changes

itaiad200 merged commit f1ce8e3 into master Jul 11, 2022

itaiad200 mentioned this pull request Jul 12, 2022

Esti: TestImport sporadically fails #3428

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logging to catch testing race-condition #3615

Add logging to catch testing race-condition #3615

itaiad200 commented Jul 5, 2022 •

edited

Loading

arielshaqed left a comment

arielshaqed Jul 5, 2022

itaiad200 Jul 5, 2022

arielshaqed Jul 5, 2022

itaiad200 Jul 5, 2022

arielshaqed Jul 5, 2022

arielshaqed Jul 5, 2022

arielshaqed left a comment

arielshaqed Jul 5, 2022

N-o-Z Jul 5, 2022

N-o-Z commented Jul 10, 2022

arielshaqed left a comment

arielshaqed Jul 10, 2022

itaiad200 Jul 11, 2022

itaiad200 commented Jul 11, 2022

itaiad200 commented Jul 11, 2022

Add logging to catch testing race-condition #3615

Add logging to catch testing race-condition #3615

Conversation

itaiad200 commented Jul 5, 2022 • edited Loading

arielshaqed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arielshaqed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

N-o-Z commented Jul 10, 2022

arielshaqed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itaiad200 commented Jul 11, 2022

itaiad200 commented Jul 11, 2022

itaiad200 commented Jul 5, 2022 •

edited

Loading