Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: update corruption check #14510

Closed
wants to merge 4 commits into from
Closed

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Sep 23, 2022

  1. Added one new field "HashRevision" into HashKVResponse. The field is populated when responding to leader or client's requests. It's the revision up to which the returned hash is calculated. Please see a72a92f
  2. Currently the leader uses its latest revision to get peers' hashes, it isn't correct, instead it should use the same revision which it uses to calculate the hash locally. Although it has the same results, but it isn't semantically correct. Please see f5590b1

Signed-off-by: Benjamin Wang [email protected]

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

cc @serathius @spzala @ptabor

@codecov-commenter
Copy link

codecov-commenter commented Sep 23, 2022

Codecov Report

Merging #14510 (123920d) into main (54f9483) will decrease coverage by 0.23%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main   #14510      +/-   ##
==========================================
- Coverage   75.41%   75.17%   -0.24%     
==========================================
  Files         457      457              
  Lines       37271    37263       -8     
==========================================
- Hits        28109    28014      -95     
- Misses       7399     7474      +75     
- Partials     1763     1775      +12     
Flag Coverage Δ
all 75.17% <100.00%> (-0.24%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/etcdserver/api/v3rpc/maintenance.go 76.00% <100.00%> (+0.65%) ⬆️
server/etcdserver/corrupt.go 88.62% <100.00%> (-0.82%) ⬇️
server/etcdserver/api/membership/store.go 50.00% <0.00%> (-10.00%) ⬇️
client/v3/namespace/watch.go 87.87% <0.00%> (-6.07%) ⬇️
server/storage/mvcc/watchable_store.go 85.14% <0.00%> (-5.80%) ⬇️
server/etcdserver/api/v3rpc/watch.go 84.29% <0.00%> (-3.85%) ⬇️
server/etcdserver/txn/util.go 75.47% <0.00%> (-3.78%) ⬇️
server/etcdserver/api/v3rpc/util.go 70.96% <0.00%> (-3.23%) ⬇️
server/etcdserver/api/v3rpc/interceptor.go 74.47% <0.00%> (-3.13%) ⬇️
client/v3/experimental/recipes/double_barrier.go 71.87% <0.00%> (-3.13%) ⬇️
... and 17 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ahrtr ahrtr marked this pull request as draft September 23, 2022 05:11
@ahrtr ahrtr force-pushed the corrupt_check_rev branch 3 times, most recently from c07ef77 to 9017ad3 Compare September 23, 2022 05:31
Run
1. ./script/genproto.sh
2. ./scripts/update_proto_annotations.sh

Signed-off-by: Benjamin Wang <[email protected]>
1. get peer's hash using the same revision as the value used by leader;
2. populate HashRevision when responding to leader or client's request.

Signed-off-by: Benjamin Wang <[email protected]>
@ahrtr ahrtr changed the title etcdserver: get peer's hash using the same revision used by the leader etcdserver: update corruption check Sep 23, 2022
@ahrtr ahrtr marked this pull request as ready for review September 23, 2022 06:27
@ahrtr
Copy link
Member Author

ahrtr commented Sep 23, 2022

The PR is ready for review. Please review this PR commit by commit. cc @serathius @spzala @ptabor

@serathius
Copy link
Member

serathius commented Sep 23, 2022

  1. Added one new field "HashRevision" into HashKVResponse. The field is populated when responding to leader or client's requests. It's the revision up to which the returned hash is calculated. Please see a72a92f

Not sure I understand the motivation. Revision is already included in HashKVRequest, so adding it to HashKVResponse seems redundant or could lead to bugs. In what cases you expect that revision in response will not match the request? For me client should always return hash for revision that was requested or return error.

  1. Currently the leader uses its latest revision to get peers' hashes, it isn't correct, instead it should use the same revision which it uses to calculate the hash locally. Although it has the same results, but it isn't semantically correct. Please see f5590b1

Ok, here I think I get your point. PeriodicCheck pases 0 to HashByRev and makes a assumption that it will return latestRevision=hash.Revision. I don't remember why I didn't clean this up, but It makes sense.

@ahrtr
Copy link
Member Author

ahrtr commented Sep 23, 2022

Not sure I understand the motivation. Revision is already included in HashKVRequest, so adding it to HashKVResponse seems redundant or could lead to bugs. In what cases you expect that revision in response will not match the request? For me client should always return hash for revision that was requested or return error.

When the client doesn't set the rev (defaults to 0) at all, then it doesn't know what's the revision up to which the hash is calculated. See example below, latest revision ( 8 in this example) isn't necessary the revision related to the hash.

$ etcdctl endpoint hashkv -w json | jq
[
  {
    "Endpoint": "127.0.0.1:2379",
    "HashKV": {
      "header": {
        "cluster_id": 14841639068965180000,
        "member_id": 10276657743932975000,
        "revision": 8,
        "raft_term": 4
      },
      "hash": 2695297616,
      "compact_revision": -1
    }
  }
]

@serathius
Copy link
Member

I can agree that calling HashKV with revision 0 doesn't make sense currently. However I'm not sure we should fix it. Checking hash values is not something I would expect users to do or be interested. Users should only care about their system being consistent, not what hash etcd currently has. I was also surprised that etcdctl has command for that. I would treat it as a more of internal, then user facing API.

It's true that existing consistency checking mechanism is not perfect, and in some cases users would want to manually the confirm the consistency. I however would say this is bad user experience.

Instead of making the internal API more user friendly, we should invest into making consistency checks better so user will never need to use HashKV by themselves. Polishing the API doesn't make sense if no user will use it. I also think that for v3.6 we should invest into raft driven consistency checking, which would make this API obsolete.

@ahrtr
Copy link
Member Author

ahrtr commented Sep 23, 2022

Thanks for the feedback.

However I'm not sure we should fix it. Checking hash values is not something I would expect users to do or be interested. Users should only care about their system being consistent, not what hash etcd currently has. I was also surprised that etcdctl has command for that. I would treat it as a more of internal, then user facing API.

Points from my side:

  1. We can't make any assumption on how users use it and whether they are interested. Please note that users can also create customized application to use the clientv3 directly.
  2. Personally I think it's a useful debug tool (I mean the check HashKV API/functionality) for developers and probably for advanced users.
  3. No matter which direction we will follow in the next step (deprecate it or continue to support it (I prefer to continue to support it)), I suggest to fix it because it's an issue to me for now.

What do you think? @serathius @spzala @ptabor

@ahrtr
Copy link
Member Author

ahrtr commented Sep 28, 2022

Do you have any comments or concerns on this PR? @serathius @spzala @ptabor

@serathius
Copy link
Member

  1. We can't make any assumption on how users use it and whether they are interested. Please note that users can also create customized application to use the clientv3 directly.

Don't agree. Users can do many things, but it doesn't mean we need to support every possibility. We need to make assumptions to provide best support for intended paths. It would only result in maintainers spreading too thin. I think that making the feature work is much more important than polishing an internal API.

Personally I think it's a useful debug tool (I mean the check HashKV API/functionality) for developers and probably for advanced users.

It works as well without HashByRev(0). It requires additional call to API to get latest revision, so we could improve user experience, however I don't think it's worth to improve UX for internal API.

No matter which direction we will follow in the next step (deprecate it or continue to support it (I prefer to continue to support it)), I suggest to fix it because it's an issue to me for now.

Don't understand the statement. Please clarify what issue it poses. My understanding that this PR is just a cleanup of API and code. There is no bug that this fixes.

@ahrtr
Copy link
Member Author

ahrtr commented Sep 28, 2022

Thanks for the feedback. Please see my response below.

It works as well without HashByRev(0). It requires additional call to API to get latest revision, so we could improve user experience, however I don't think it's worth to improve UX for internal API.

HashKV is a public API, which was added more than 5 years ago. The related command etcdctl endpoint hashkv was also added more than 5 years ago. Both of them are public. I am not sure whether we are talking about the same thing.

I also think we should keep supporting the API.

Don't understand the statement. Please clarify what issue it poses.

The default rev value of command etcdctl endpoint hashkv is 0. When users run command something like etcdctl endpoint hashkv -w table --cluster, since etcdctl just gets members' hashes one by one, so if there are any new K/V data in between, then the users will get an inconsistent hashes (which is actually expected), accordingly they may take wrong action to replace the "problematic" member. This is the issue I talked about. Of course, uses can clearly set a rev to resolve this. But this is definitely a nice to have enhancement/fix to me.

$ etcdctl endpoint hashkv -w table --cluster
+------------------------+------------+
|        ENDPOINT        |    HASH    |
+------------------------+------------+
|  http://127.0.0.1:2379 | 1084519789 |
| http://127.0.0.1:22379 | 1084519789 |
| http://127.0.0.1:32379 | 1084519789 |
+------------------------+------------+

@ahrtr
Copy link
Member Author

ahrtr commented Sep 29, 2022

@ahrtr ahrtr closed this Sep 29, 2022
@serathius
Copy link
Member

I can agree that etcdctl endpoint hashkv -w table --cluster is not very useful, however adding revision to response doesn't improve this as user will want to query all members about the same RV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants