Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: canceled should always be set to true when cancel a watch request #373

Merged
merged 3 commits into from
Dec 6, 2024

Conversation

nic-6443
Copy link
Contributor

@nic-6443 nic-6443 commented Dec 4, 2024

Anytime the server attempts to cancel a watch request, it should set the canceled field to true, regardless of whether err is empty or not. Let err only affect the reason field.

@nic-6443 nic-6443 requested a review from a team as a code owner December 4, 2024 10:07
@brandond
Copy link
Member

brandond commented Dec 4, 2024

Thanks! Can you provide any information on what issue this is causing so that we can validate the fix?

@nic-6443
Copy link
Contributor Author

nic-6443 commented Dec 5, 2024

Thanks! Can you provide any information on what issue this is causing so that we can validate the fix?

Sure, the problem I encountered is as follows: I used kine as a replacement for etcd in the https://github.com/apache/apisix project, using postgres as the backend database. When performing maintenance operations on postgres, it needs to be restarted. If kine happens to receive a watch request at this time, this request will be canceled by kine due to SQL execution failure:

rev, kvs, err := l.log.After(ctx, prefix, revision, 0)
error log:

time="2024-09-24T20:20:23Z" level=error msg="Failed to list / for revision 3095: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)"

Although kine cancels this request, it does not set the canceled field to true, causing abnormal client behavior. The response returned by kine is:

{"result":{"header":{},"watch_id":"1"}}

I have checked the native etcd code and found that when canceling a watch request, they always set canceled to true. I think we should stay consistent with etcd.

@brandond
Copy link
Member

brandond commented Dec 5, 2024

That is interesting. This suggests that the error is not being propagated up from the failed list call, since err != nil is false and the cancelReason is empty. Are you able to run kine with --debug and find the output of the WATCH CANCEL id= log when this happens?

@nic-6443
Copy link
Contributor Author

nic-6443 commented Dec 5, 2024

That is interesting. This suggests that the error is not being propagated up from the failed list call, since err != nil is false and the cancelReason is empty. Are you able to run kine with --debug and find the output of the WATCH CANCEL id= log when this happens?

this is the log:

TRAC[0022] WATCH CANCEL id=1, reason=, compactRev=0

I have considered whether to return the cause of the error to the client, such as FATAL: terminating connection due to administrator command (SQLSTATE 57P01), but considering that this is internal server information, it seems inappropriate to return it to the client. If you think it is necessary to pass this error message through to the client, I can consider recording the error in server.WatchResult and returning it in the cancel response.

@brandond
Copy link
Member

brandond commented Dec 5, 2024

I don't know if we need to expose the raw error message from the sql client to the etcd client, but it does seem correct to return AN error.

@nic-6443
Copy link
Contributor Author

nic-6443 commented Dec 5, 2024

Understood, then I will return a generic database error message.

@nic-6443
Copy link
Contributor Author

nic-6443 commented Dec 5, 2024

@brandond send a error message to client now, since this error only occurs when there is an exception in the database, it is difficult to write automated tests.
Therefore, I verified that the change works locally by injecting errors into the query function.

❯ curl https://localhost:7943/v3/watch --cert /tmp/tls.crt --key /tmp/tls.key -k -X POST --data '
{"create_request":{"key":"L2FwaXNpeC8=","start_revision":61,"range_end":"L2FwaXNpeDA="}}
'
{"result":{"header":{},"watch_id":"1","created":true}}
{"result":{"header":{},"watch_id":"1","canceled":true,"cancel_reason":"failed to execute query in database"}}

Copy link
Member

@brandond brandond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one nit!

wr.CompactRevision = compact
wr.CurrentRevision = rev
} else {
errc <- errors.New("failed to execute query in database")
Copy link
Member

@brandond brandond Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add ErrWatchCancelled = rpctypes.ErrGRPCWatchCanceled in pkg/server/types.go and just return that here as a generic watch failure message. We try to return the same errors as upstream whenever possible, even if the cause isn't necessarily the same.

If you can think of a better error from https://pkg.go.dev/go.etcd.io/etcd/api/v3/v3rpc/rpctypes to use please feel free.

Copy link
Contributor Author

@nic-6443 nic-6443 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, great, I think ErrGRPCUnhealthy is better to express the failure come from system level.

{"result":{"header":{},"watch_id":"1","created":true}}
{"result":{"header":{},"watch_id":"1","canceled":true,"cancel_reason":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}}

@brandond brandond changed the title fix: canceled should always be set totrue when cancel a watch request fix: canceled should always be set to true when cancel a watch request Dec 6, 2024
@brandond brandond merged commit ffc8345 into k3s-io:master Dec 6, 2024
3 checks passed
@nic-6443 nic-6443 deleted the nic/fix-cancel-response branch December 6, 2024 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants