-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plans for v3.5.3 release #13894
Comments
@serathius not sure if you want to add #13895 as well. |
I think we need 13838 as well, because it will make the reproduction of the data inconsistency issue easier. |
Hey @ahrtr, It would be great to have automatic way to reproduce this issue, however this is not required immediate release. #13838 includes changes that might negatively impact overall stability of functional tests making it hard to merge. I don't plan to make this code mergeable, to avoid investing to much time into functional testing as we might want to change our overall approach to correctness testing after the data inconsistency issue. The data inconsistency issue should be fairly tested with #13838, but I don't see merging this change as a blocking for v3.5.3. |
@serathius is it possible to include #13460 "Support multiple values for allowed client and peer TLS identities" ? |
@riyazcom Even though not supporting multiple client & peer TLS identities is a limitation of etcd, it is not a bug and introducing it would be basically be an additional feature. Based on that I don't think we should back port. |
what is the manual test plan for reproducing this ? is there a k8s centric test we can run ? I was doing a combination of running e2es while rebooting nodes, and didnt see this occur. I think it requires unrestricted API QoS , constrained etcd memory, and a few other tricky knobs. |
@jayunit100 please read the original issue #13766 (comment), it includes description of reproduction and link to PR with code #13838. Let's continue the discussion there. |
@serathius #13692 doesn't seems required in 3.5.3. I am adding triage comments there. |
13666 is just a question instead of an issue. I already provided answer and closed it. Definitely it isn't required in 3.5.3. |
Please also remove 13690 from the list per the discussion in the PR. |
Just closed 13848 with comment 13848#issuecomment-1093449532. |
Hi ETCD maintainers, As the community has alerted the customers not use the ETCD 3.5.[0-2] in production env due to the issue data inconsistency, but is there any plan to speed up the 3.5.3 release considering some products has already blocked by the issue? |
@yuyangbj Thanks for your interest in v3.5.3 release. Apart of just "fixing" the issue, we also need to make sure we do proper diligence to test and qualify the release. We would really not want to make a mistake here and be forced to send another alert. Unfortunately this takes time and with only 3-4 people working on the release it will take us some time. Data inconsistency issue was non-trivial and it took us longer time to figure out a proper fix. Fortunately we finally managed to merge the fix last Friday, so we can start qualifying the release and decide to skip some minor issues proposed here. Best thing I can recommend for products blocked on the release is stop waiting passively and get involved into the release proces. We need more people to test, document and qualify the release. |
@serathius Apologies!! |
New issue that might impact v3.5.3 release. Please someone take a look #13922 |
Thanks @endocrimes, agree it's not a critical release blocker, however for now let's assume it will go in, as qualification will take some time. |
@serathius 👍 - what's necessary for qualification? (not sure if there's a test process that gets followed or if setting up some soak/scale tests out-of-tree would be helpful for example) |
That's great question, apparently v3.5 release haven't followed the same qualification as previous releases (#13839 (comment)). With maintainer rotation we lost knowlegde about processes previously executed manually, no mention of qualification testing in https://etcd.io/docs/v3.3/dev-internal/release/. I would be great to get idea what kind of tests were done before. @xiang90 @jpbetz any recommendation about the testing? |
In addition to the qualification mentioned before, would be good to confirm that two data inconsistency issues found were fixed. Instructions for their reproduction: |
@serathius Have we in the past have some method to try a k8s (presubmit?) testing with a branch of etcd? (curious, if so, then we could try that?) |
We run v3.5.0 in K8s presubmits for long time (we still are), however they didn't find any issues we fixed for v3.5.3 so I expect they are not good qualification method. |
@serathius I'm also bumping the jepsen etcd tests for 3.5.x too. (Starting with running them against 3.4.x and 3.5.2 as a baseline). I don't think they're high scale enough tests to catch the corruption issues we hit here, but potentially a nice thing to get running semi continuously somewhere. |
Thanks Marek @serathius for the prompt response. And I understand everything needs a complete tests. I do not want to push the release, so I am sorry if it makes a misunderstanding like it. And our downstream team has actively involved to reproduce the data inconsistency issue, will update to here if we have test results. |
Just resolved in pull/13930, it should be a legacy issue (or an old regression) instead of a regression. |
Current plan for qualification:
We would be happy to accept any other ideas to make sure the release is stable and no new data corruption issues are found. |
@shalinmangar in #13839 (comment) you mentioned that you have experience with jepsen. It would be great if you can help @endocrimes as we cannot reproduce the data inconsistencies from v3.5.2 there. |
Qualification work is almost finished, I would like to prepare to start the release process. First let's collect acts from people involved: @ptabor @spzala @ahrtr @endocrimes |
So current state of jepsen testing:
Future plan:
Update: 18:14 Watch and Lock tests failed on my first run - investigating whether it was a jepsen issue or etcd issue. |
That list looks complete to me. Thank you @serathius and @ahrtr. I think that these items are below the bar (but calling them explicitly):
@serathius: Out of curiosity: Have you been running the repro test for several hours with patched 3.5.2 or just the O(10)min test ? |
Jepsen Watch test failures seem to be a jepsen issue, but I haven't run into those issues in any of the previous version runs, so I'm... curious. Moving on to lock tests for now. |
Locks (leases) seem to be consistently flaky - The lease ends up closing early, and thus the test fails. Gonna investigate further after dinner. |
Gonna look into this with fresh eyes tomorrow - getting a headache - but it feels like potentially a clojure/client issue |
Sooo close!! |
I think it's worth to mention @serathius 's PR pull/13927. Ran the load test against mounted RAM filesystem might be an easier way to reproduce the data inconsistent issue. I will give it a try.
Please note that I recently fixed a lease related issue, see pull/13932, which isn't included in 3.5.2. Do you still see flaky on release-3.5 (3.5.3)? If yes, please file an issue, and let's discuss it there.
@serathius already submitted a PR for this, and I will take a closer review today & tomorrow.
Yes, it's a good point. The consistent_index should never decrease based on current design/implementation. But the term might be. I am not totally convinced that the PR pull/13903 fixes the issue issues/13514, because raft should reject a message with old term. |
@serathius awesome, thanks so much! It looks good to me. If we are running more tests, that's great, but otherwise, from the positive feedback on the data inconsistency fix and the tests you and @ahrtr ran, +1 for release work. We can iterate over the release with any new findings, post-release feedback and test improvements under discussion. |
I run:
|
You are right that raft should never allow a message with an old term, however #13514 covers not officially supported downgrade flow, that went around downgrade check, and showed us that term in db cannot be trusted in v3.5. With #13903 goal was not to fix downgrades, but to remove incorrect |
We might need to get this issue 13937 sorted out before releasing 3.5.3. |
@ahrtr Is this a new problem in v3.5? Based on the original issue it seems to be present in v3.3. For v3.5.3 our goal should be addressing new issues that were introduced in v3.5 to ensure there are not regressions when upgrading from v3.4 to v3.5. Fact that there is a long standing issue is worrying, however it should not block v3.5.3 release. |
Happy to help. I see @endocrimes has already made great progress. Is there an issue for Jepsen testing? or should we communicate over the Etcd-dev mailing list? |
Talked with @endocrimes, lease flakes should not impact v3.5.3 release. |
w00t! sounds good @serathius |
ok, after 4 hours, with breaks for restoring sanity and eating, I managed to successfully run release script. |
My immediate feeling is it should be legacy issue instead of a regression. Will take a closer look today. I don't think it's a blocker for 3.5.3 either. |
This turns out to be a regression. But the good news is it should not be a big problem. Please see the explanation in pull/13942 |
Same as previous release #13518. let's agree what work should be included in release and track the work. The list presented below will be evolving as community decides to cut/include some fixes.
Issues/PRs that should be included for v3.5.3:
Issues/PRs not required for v3.5.3:
ETCD_INITIAL_CLUSTER
not handled correctly #13757Please feel free to suggest new additional backports/issues worth investigating so we can discuss if they should be included.
The text was updated successfully, but these errors were encountered: