Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use errors.Is() in IsErrEOF() #8698

Merged
merged 5 commits into from
Mar 16, 2021
Merged

Conversation

pierreca
Copy link
Contributor

IsErrEOF returns false when it should return true in a couple of cases:

  1. if the error has been wrapped in another error (for example, if EOF
    is wrapped in an RPC error)
  2. if the error has been created from an Error field in an RPC response
    (as it is the case in CallWithCodec in the net-rpc-msgpackrpc package
    for example)

IsErrEOF returns false when it should return true in a couple of cases:

1. if the error has been wrapped in another error (for example, if EOF
is wrapped in an RPC error)
2. if the error has been created from an Error field in an RPC response
(as it is the case in CallWithCodec in the net-rpc-msgpackrpc package
for example)
@hashicorp-cla
Copy link

hashicorp-cla commented Sep 17, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR!

I think using errors.Is is the right thing to do here, but some tests for this change would be great! A table test with some different cases seems appropriate.

One concern inline below.

lib/eof.go Outdated
return true
}

errStr := err.Error()
if strings.Contains(errStr, yamuxStreamClosed) ||
strings.Contains(errStr, yamuxSessionShutdown) {
strings.Contains(errStr, yamuxSessionShutdown) ||
strings.HasSuffix(errStr, io.EOF.Error()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this seems reasonable, but I think because EOF.Error() is such a short string (EOF), there is a pretty significant chance this incorrectly matches a a different error.

A suffix of : EOF might be a bit safer, but I'm not sure that is necessary either. I think we should be encouraging all code to use wrapping and not rely on string comparison. Ideally we would remove the need for these other strings.Contains() as well.

I think we should not add this strings.HasSuffix, and fix any code that does not properly wrap errors to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on principle and long-term that's what i'd want too.

To give some context on the string comparison, the code that motivated it is here:
https://github.com/hashicorp/net-rpc-msgpackrpc/blob/master/client.go#L33

In this case, the error comes in the Response.Error field as a string, and the string is copied into an error object. In the specific case I am seeing the string is: rpc error: EOF.
This error created by this line of code does not unwrap correctly as an EOF and as far as I can tell, cannot be identified as an EOF other than by string comparison. Maybe there's a quick fix I don't know about and if that's the case I'll happily submit another PR for it, but if there isn't i'm afraid the bigger fix involves changing how the RPC client works, and If I understand the situation correctly, this is used in many places in many hashicorp projects. Which makes this change really scary for a first time external contributor.

Thoughts?

Copy link
Contributor

@dnephin dnephin Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes I see the problem. Do you know where the rpc error: prefix comes from? I'm not seeing anywhere that would append an EOF.

I think if there are specific cases like this that we know can not be unwrapped, we could first check that the error is a specific type, then check the string, something like

if errors.Is(err, rpc.ServerError{}) && strings.HasSuffix(err.Error(), ': EOF') {}

I think if we can track down the specific cases where this is a problem we can at least be very struct about the comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea, I'll send an update.

The scenario i'm testing is on long-poll requests where a user of the API is connected to a nomad client, the client is forwarding the RPC to a server, and while the request is blocked, the leader is killed. In that specific case, sometimes, I see this wrapped error as string. I think it may have to do with how the requests are forwarded between the client, its server, and potentially the cluster leader. I'm still learning how that codepath works, I've only been staring at it for two days :)

What can I do that would help narrow this down? I can repro on a fairly regular basis, I just need help orienting myself in the code and adding logs where we need them to confirm the problematic code path(s).

Copy link
Contributor Author

@pierreca pierreca Sep 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the ServerError type assertion: I hope it is restrictive enough and solves at least one of the failure mode I'm trying to get past (explained below, step 3). I'm not out of the woods yet but if this gets checked in, I hope it'll help improve robustness.

Long explanation
I am starting to understand that the problem I'm trying to solve goes deeper and lies in the forwarding chain of the RPC requests in a nomad cluster. Here's the scenario:

  1. api user makes a blocking request (wait on an index) -> nomad client -> nomad server (leader)
  2. leader dies/is killed while the request is waiting. Returns an EOF error
  3. with this fix: client retries and changes server: client -> nomad server(follower) (without the fix, the client would not understand it received an EOF error and would not retry, that's the improvement)
  4. nomad server (follower) -> nomad server(leader) connection refused: the leader (that died in step 2) is down and new leader election hasn't happened yet
  5. follower does not retry and passes the error (connection refused) down to the client in the response.Error field
  6. client: gets wrapped version of the connection refused error and since it's not an EOF error, passes it down to the caller <- that's my new problem :)

From the caller perspective this is difficult, if not impossible, to handle correctly: the caller shouldn't be expected to understand (and recover from) RPC forwarding chain errors. There are a finite number of scenarios like this (depending whether the client is connected to a follower or a leader, how fast leader election goes, and which server goes down, the one doing the forwarding or the leader). I think this should be entirely handled within the RPC chain between the nomad client and servers. retries and HA are hard problems though :)

This being said, I don't think it's realistic to expect that all cases can be covered with IsErrEOF(), which is already a little bit extended past EOF (the yamuxError, and now some ServerErrors provided they end with : EOF). The follow up investigation is probably directly in the nomad RPC code.

In conclusion, I still think this is a net improvement and we've pushed this PR as far as it should go, happy to make more changes as required to get it merged, this is a move in the right direction :)

@dnephin dnephin added the theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization label Sep 17, 2020
lib/eof.go Outdated
return true
}

errStr := err.Error()
if strings.Contains(errStr, yamuxStreamClosed) ||
strings.Contains(errStr, yamuxSessionShutdown) {
strings.Contains(errStr, yamuxSessionShutdown) ||
strings.HasSuffix(errStr, io.EOF.Error()) {
Copy link
Contributor

@dnephin dnephin Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes I see the problem. Do you know where the rpc error: prefix comes from? I'm not seeing anywhere that would append an EOF.

I think if there are specific cases like this that we know can not be unwrapped, we could first check that the error is a specific type, then check the string, something like

if errors.Is(err, rpc.ServerError{}) && strings.HasSuffix(err.Error(), ': EOF') {}

I think if we can track down the specific cases where this is a problem we can at least be very struct about the comparison.

lib/eof_test.go Outdated
)

func TestErrIsEOF(t *testing.T) {
t.Parallel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know a lot of the tests use t.Parallel, but it can cause problems sometimes so I've been trying to remove it where possible. I think in this case, since these tests run so fast, it's actually slower to spin up the goroutine to run the tests in parallel, than it would be to run it without parallel.

Suggested change
t.Parallel()

lib/eof_test.go Outdated

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
t.Parallel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Parallel()

lib/eof_test.go Outdated
require.True(t, IsErrEOF(tt.err))
})
t.Run(fmt.Sprintf("Wrapped %s", tt.name), func(t *testing.T) {
t.Parallel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Parallel()

lib/eof_test.go Outdated
require.True(t, IsErrEOF(fmt.Errorf("test: %w", tt.err)))
})
t.Run(fmt.Sprintf("String suffix is %s", tt.name), func(t *testing.T) {
t.Parallel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Parallel()

@pierreca
Copy link
Contributor Author

@dnephin would you mind giving this another look and let me know if these last commits address everything you wanted?

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this implementation is looking good, but I realized I still may not fully understand the specific problem this is fixing. I need to read over your detailed comment above again more carefully to make sure I understand. My main concern is that this IsErrEOF function is called from a few places, and I'm not sure if this change will be correct for all of the places.

lib/eof.go Outdated Show resolved Hide resolved
@pierreca
Copy link
Contributor Author

@dnephin no worries, I understand, i've been on the maintainer side for years in the past. it's hard. happy to jump in a slack conversation or something like that if it makes it easier.

The bottom line is that I have found that the nomad retry logic is broken in rpc forwarding scenarios and that part of the problem is not all EOF errors aren't properly identified as such. What this change does is widen the scope of what gets recognized as EOF to include wrapped io.EOF errors and EOF errors wrapped in ServerError that are sent in the RPC response body.

What are the risks?

IDK what your test coverage looks like but if there are tests I can write to help you make sure this doesn't break, just point me where. I think there is one or two more changes needed in nomad after that to fix retries in RPC forwarding chains and that'd be a win for a lot of users :)

@pierreca
Copy link
Contributor Author

pierreca commented Oct 5, 2020

@dnephin ping? anything else I can do to help get this checked in? we hit errors related to nomad not retrying when it should very consistently. there's more work that needs to happen to fix the problems but this is at the root of a lot of missed retries.

@pierreca
Copy link
Contributor Author

@dnephin still hoping to upstream this, lmk if I can help address any doubt that this is a good fix.

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix!

One suggestion for some additional test cases, but not blocking.

Comment on lines +18 to +23
{name: "EOF", err: io.EOF},
{name: "Wrapped EOF", err: fmt.Errorf("test: %w", io.EOF)},
{name: "yamuxStreamClosed", err: yamux.ErrStreamClosed},
{name: "yamuxSessionShutdown", err: yamux.ErrSessionShutdown},
{name: "ServerError(___: EOF)", err: rpc.ServerError(fmt.Sprintf("rpc error: %s", io.EOF.Error()))},
{name: "Wrapped ServerError(___: EOF)", err: fmt.Errorf("rpc error: %w", rpc.ServerError(fmt.Sprintf("rpc error: %s", io.EOF.Error())))},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple tests that are expected to return false might be good as well, to check that errors which are not EOF are not matched.

@dnephin dnephin merged commit 931023f into hashicorp:master Mar 16, 2021
@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/338197.

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit 931023f onto release/1.9.x succeeded!

hashicorp-ci pushed a commit that referenced this pull request Mar 16, 2021
@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit 931023f onto release/1.8.x succeeded!

hashicorp-ci pushed a commit that referenced this pull request Mar 16, 2021
@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit 931023f onto release/1.7.x succeeded!

hashicorp-ci pushed a commit that referenced this pull request Mar 16, 2021
dizzyup pushed a commit that referenced this pull request Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants