Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove infinite loop in doSerialize #10218

Merged
merged 1 commit into from
Jan 9, 2019
Merged

Conversation

horkhe
Copy link
Contributor

@horkhe horkhe commented Oct 26, 2018

Problem

If authentication is enabled, then changing permissions of a role while there are active readers/writers authenticated with the role can cause Etcd nodes consume 100% of CPU.

Root cause

Consider the following function:

// doSerialize handles the auth logic, with permissions checked by "chk", for a serialized request "get". Returns a non-nil error on authentication failure.
func (s *EtcdServer) doSerialize(ctx context.Context, chk func(*auth.AuthInfo) error, get func()) error {
	for {
		ai, err := s.AuthInfoFromCtx(ctx)
		if err != nil {
			return err
		}
		if ai == nil {
			// chk expects non-nil AuthInfo; use empty credentials
			ai = &auth.AuthInfo{}
		}
		if err = chk(ai); err != nil {
			if err == auth.ErrAuthOldRevision {
				continue
			}
			return err
		}
		// fetch response for serialized request
		get()
		//  empty credentials or current auth info means no need to retry
		if ai.Revision == 0 || ai.Revision == s.authStore.Revision() {
			return nil
		}
		// avoid TOCTOU error, retry of the request is required.
	}
}

Synopsis: Once chk(ai) fails with auth.ErrAuthOldRevision it will always fail with the same error regardless how many times you retry.

Details: ai is retrieved from the incoming request auth token and has the version number of the role at the time when the connection was authenticated, so while in this loop this number never changes. On the other hand chk returns auth.ErrAuthOldRevision when the ai version number is not the same as the role version in s.authStore. The role version is incremented every time when the role is updated. So ai version is constant and the role version from s.authStore can only increase, hence the check will always fail.

Solution

Just return the error and let the client deal with it by re-authenticating.

@gyuho-bot
Copy link

Welcome @horkhe! It looks like this is your first PR to etcd-io/etcd 🎉🎉

@gyuho-bot
Copy link

Hi @horkhe. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codecov-io
Copy link

codecov-io commented Oct 26, 2018

Codecov Report

Merging #10218 into master will decrease coverage by 0.14%.
The diff coverage is 63.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10218      +/-   ##
==========================================
- Coverage   71.64%    71.5%   -0.15%     
==========================================
  Files         390      390              
  Lines       36369    36367       -2     
==========================================
- Hits        26056    26003      -53     
- Misses       8493     8548      +55     
+ Partials     1820     1816       -4
Impacted Files Coverage Δ
etcdserver/v3_server.go 79.08% <63.63%> (+0.59%) ⬆️
proxy/grpcproxy/register.go 69.44% <0%> (-13.89%) ⬇️
auth/simple_token.go 77.23% <0%> (-9.76%) ⬇️
etcdserver/apply.go 80.36% <0%> (-8.52%) ⬇️
pkg/netutil/netutil.go 63.11% <0%> (-6.56%) ⬇️
pkg/testutil/recorder.go 77.77% <0%> (-3.71%) ⬇️
clientv3/leasing/cache.go 87.77% <0%> (-3.34%) ⬇️
proxy/httpproxy/director.go 80% <0%> (-2.86%) ⬇️
clientv3/leasing/kv.go 89.03% <0%> (-1.67%) ⬇️
etcdserver/api/v2http/client.go 85.51% <0%> (-1.21%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae25c5e...91e583c. Read the comment docs.

@jingyih
Copy link
Contributor

jingyih commented Oct 26, 2018

/lgtm

@gyuho-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jingyih
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: jpbetz

If they are not already assigned, you can assign the PR to them by writing /assign @jpbetz in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xiang90 xiang90 removed the lgtm label Oct 26, 2018
@xiang90
Copy link
Contributor

xiang90 commented Oct 26, 2018

Just return the error and let the client deal with it by re-authenticating.

I would like someone to explore more if we can retry correctly in the server side. If there is no easy way to do it, we can let the client to handle it.

@gyuho Only people in the maintainer list should be able to put lgtm label. It seems that everyone in the org can do this today, which is not what we want.

@gyuho
Copy link
Contributor

gyuho commented Oct 26, 2018

@xiang90 Yeah, think /lgtm label is open to all reviewers. I will see if we can tweak it.

@xiang90
Copy link
Contributor

xiang90 commented Oct 27, 2018

@xiang90 oh. i see, we actually separate the approve and lgtm labels like k8s does... then it is fine...

@horkhe
Copy link
Contributor Author

horkhe commented Oct 27, 2018

@xiang90 do you want me to investigate possible server side retries? Honestly I do not understand why you need to check if the version is different at all? Unless you are shooting for the case that a role was revoked some permissions. In that case chk rather then return auth.ErrAuthOldRevision should confirm that the accessed range is still allowed. The downside of this is that the extra verification would need to be repeated on each subsequent request since the auth token would still contain old version. Making clients re-authenticate would allow avoiding the extra check, and it is the simplest possible solution IMHO.

@horkhe
Copy link
Contributor Author

horkhe commented Oct 27, 2018

I looked a bit more and turned out that chk calls auth.(*authStore).isOpPermitted that checks if the accessed range is allowed. But first it checks if the version is correct. And I do not understand why you need to check the version, when you check the range anyway? It seems to me as long as the accessed range is allowed who cares what the version was when the session was authenticated. Or am I missing something?

@mitake
Copy link
Contributor

mitake commented Oct 28, 2018

@horkhe thanks for your PR. It seems that continuing in the case of auth.ErrAuthOldRevision is incorrect. The motivation of checking the version is denying tokens which were issued with old auth information. In this case, ideal behaviour would be replying the error code and letting the client auth again or report the error.

In addition, the version number validation in this line (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560) should be performed because doSerialize() can be executed during auth store updates in a concurrent manner.

@horkhe
Copy link
Contributor Author

horkhe commented Oct 28, 2018

@mitake my question is why does it matter what the version is? IMHO as long as this range check passes, it should be irrelevant what the role version had been when the session was authenticated is not it?

@jingyih jingyih removed their assignment Oct 30, 2018
@horkhe
Copy link
Contributor Author

horkhe commented Oct 30, 2018

By the way, we checked, the client does not re-authenticate on auth.ErrAuthOldRevision. So it also needs to be fixed.

@mitake
Copy link
Contributor

mitake commented Oct 31, 2018

@horkhe sorry for my late reply.

IMHO as long as this range check passes, it should be irrelevant what the role version had been when the session was authenticated is not it?

The motivation of version number validation is a little bit complicated. I'd like to share it with you hopefully this weekend. Is this ok?

@horkhe
Copy link
Contributor Author

horkhe commented Oct 31, 2018

@mitake sure, whenever it is convenient to you. Now that we are running a patched etcd version we are not in a rush :-).

@mitake
Copy link
Contributor

mitake commented Nov 11, 2018

@horkhe Sorry for my late reply. I'd like to explain why the version number validation is required.
The motivation behind the validation mechanism is rejecting operations which are authorized with stale information. Let's assume a schedule like this: 1. a client X calls Authenticate() RPC and gets a token 2. another client Y changes password of X and update auth store 3. X issues a KV request (Range() or Txn with read ops) with the token

If every operation is executed in a serialized manner, nothing bad will happen. Basically the state machine of etcd executes its command in a serialized manner so no problems seem to be possible. However, we cannot have a guarantee about the order of 2 and 3. This is because for avoiding auth overhead (especially bcrypt's hash value calculation), etcd executes the step 2 in the outside of the state machine loop. So X's token can be stale in the step 3 if the authorization in the step 1 is executed with the step 2 in a parallel manner.

For avoiding serving requests with a token based on stale auth store's state, the version number check is required. The staleness can be checked by comparing a version number in the token (a number of the auth store's state when the token was generated) and and current version number of auth store. If they are different, it means the token is invalid.

Probably the best way of handling this is simply rejecting a request with a stale token and letting the client auth again. How do you think?

@horkhe
Copy link
Contributor Author

horkhe commented Nov 12, 2018

@mitake if I got you right, the point of version checking is to reject access to data via sessions authenticated with old passwords? Well in that case version checking is indeed needed.

The solution proposed in this PR solves the first part - it rejects requests made via sessions with stale auth tokens, but the client does not handle those errors internally, it just returns them to end users. So the client code also needs to be updated. I suggest you guys develop the solution on the client side. I do not know the code well enough for that.

@mitake
Copy link
Contributor

mitake commented Nov 12, 2018

@horkhe yes but you should not remove the validation starting from this line: https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L559

@horkhe
Copy link
Contributor Author

horkhe commented Nov 12, 2018

@mitake I can put it back. But honestly I see no point. Consider this: the version of the store changes a nanosecond after the second check, how is it different? The second check only adds to be rejected a handful of reads, that happen between the first and the second check, in the interval of few milliseconds. I would say why bother. Any complication albeit small is still a complication... but it is you call and I will put it back.

Once chk(ai) fails with auth.ErrAuthOldRevision it will always do,
regardless how many times you retry. So the error is better be returned
to fail the pending request and make the client re-authenticate.
@horkhe
Copy link
Contributor Author

horkhe commented Nov 26, 2018

@mitake are there any more comments that I need to address?

@mitake
Copy link
Contributor

mitake commented Nov 26, 2018

@horkhe sorry for my late reply. Yeah it is a very corner case. But I think having the validation mechanism will make reasoning safer because it keeps the linearlizable semantics. It makes the behavior of etcd predictable e.g. combination of auth config update and network partitioning can allow reading keys with stale permission (the schedule will be like this: 1. client issues Authenticate() successfully 2. an etcd node is partitioned from majority of nodes 3. auth config is updated in the majority of the nodes 4. the client issues serializable read to the partitioned node). In addition, its runtime cost is very cheap.

I'll review your recent change later. Sorry for keeping you waiting.

@mitake
Copy link
Contributor

mitake commented Jan 4, 2019

@horkhe sorry for my very late reply... I think this can be merged. The failed CI wouldn't be related to this change. I rerun travis.

@horkhe
Copy link
Contributor Author

horkhe commented Jan 4, 2019

@mitake thank you.

@xiang90
Copy link
Contributor

xiang90 commented Jan 9, 2019

CI failure is not related. So merging.

@xiang90 xiang90 merged commit 2063b35 into etcd-io:master Jan 9, 2019
@horkhe
Copy link
Contributor Author

horkhe commented Jan 10, 2019

@xiang90 thanks a lot. Well, now the client code needs to be fixed to reconnect on auth.ErrAuthOldRevision should I create an issue for that?

@xiang90
Copy link
Contributor

xiang90 commented Jan 10, 2019

@horkhe Can you fix it directly?

@horkhe
Copy link
Contributor Author

horkhe commented Jan 15, 2019

@xiang90 sorry, but I am not familiar with that part of the code, and it has been awhile since I looked into it, and I have moved on since then.

@mitake
Copy link
Contributor

mitake commented Jan 15, 2019

@horkhe @xiang90 I can handle that part. Opened a new issue here: #10408

@horkhe
Copy link
Contributor Author

horkhe commented Jan 15, 2019

Thank you very much @mitake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants