-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure only primary sender drives slot ownership updates #754
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #754 +/- ##
============================================
+ Coverage 70.25% 70.28% +0.03%
============================================
Files 112 112
Lines 60590 60587 -3
============================================
+ Hits 42567 42586 +19
+ Misses 18023 18001 -22
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question from the issue (repeating here):
Is it any problematic to handle topology update from replica with higher config epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change seems to make sense but I don't fully understand the fix and how it solves the problem. The essence of the problem can be seen in the diff screenshot in #753? Please update the PR description to independently describe the change, rather than only referring to the issue.
It's harder to review the fix when there's refactoring in the same commit. It's easier if the refacoring and the acutal fix are at least in two separate commits.
(I imagine Oran Agra would have rejected the renaming of variables, to avoid any kind of unnecessary changes.)
Yeah I think it will work but I feel that it might involve a bigger change. This change on the other hand is more scoped, relatively speaking. We can definitely explore this idea separately/incrementally. |
Yes - replicas updating the slot ownership on behalf of their primaries is the regression (or behavior change) introduced by #445.
I don't agree with the "unnecessary changes". All these changes are either a continuation/fine-tune of #445 or for a good reason of improving readability. I have gone through this code path more times than I can count but I still get lost easily because the naming is just so confusing. A blank statement of "unnecessary changes" is just saying "I don't know what I am doing" - sorry for the unnecessarily strong response :) |
Sorry for causing frustration! I know you've very familiar with this code. More so than the rest of us. With "unnecessary changes", I just meant changes that don't change the logic, i.e. refactoring, not strictly necessary to fix the bug. I didn't say I'm against it. Only that I'd prefer them in separate commits, since it would have made it easier for me to read the diff and spot the actual behaviour change. The diff is +68 −82, while the actual fix is ~5 lines. Even the title "ensure only primary sender drives..." seemed a bit cryptic to me at first, but of course it's obvious to anyone who has this code fresh in memory. (My though processes involved "What's a primary sender again... Hm.. right, it's when we're receiving a packet on the cluster bus from a primary.") I'm trying to make reviewing easier for myself by requiring more from the contributors. Not sure if it's sane but I want to be able to handle more PRs faster without being burnt out. I should know you're in the same position. |
No worries - we are all doing our jobs.
I disagree. I don't consider this a refactoring, which I agree is better handled in a separate PR. There is a reason why I introduced the regression in the first place. Our discussions on the names show the exact problem of this code not having established clear mental concepts and this was partly the reason why I introduced the regression in the first place. Making these changes in a separate PR loses the exact context of why these changes are needed. And refactoring PRs don't normally get the right amount of scrutinization, which is what I want for this PR. I am all for any nitpick that could help improve the code quality in such a critical part of the code base.
I believe I have explained the issue to my best ability in #753, which I linked to this PR as well. Have you got a chance to check it out? I have a tendency to split the bug and PR. Will make sure to carry the context over for this (and future) PR. |
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
I have updated the names per recommendation. I think they look clearer now. @zuiderkwast PTAL. Also I think there is another (0-day) bug. We never check the sender's config epoch before accepting its claim of being a primary. I will address it in a separate PR :). https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L3119 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! These "sender-claimed" names definitely make things more clear.
Also I think there is another (0-day) bug. We never check the sender's config epoch before accepting its claim of being a primary. I will address it in a separate PR :).
OK. I wouldn't mind having it in the same PR (just a separate commit within the PR for easier reading). But a separate PR is also good to if we want to backport it or something.
I don't consider this a refactoring, which I agree is better handled in a separate PR.
I never said separate PR. That would be excessive. I just meant separate commits within the same PR. It might have made it easier to review, though this may just have been me trying to find excuses for not understanding the fix.
I believe I have explained the issue to my best ability in #753, which I linked to this PR as well. Have you got a chance to check it out? I have a tendency to split the bug and PR. Will make sure to carry the context over for this (and future) PR.
Yes, the issue is very good. The problem description and the description of the change are not the same things though. In the PR I think it's good to describe the actual change. The problem description doesn't need to be repeated in detail in the PR when it's explained in the issue.
For the PR description, I'd be happy just what you wrote under "Fix:", and possibly mentioning the additional changes to make it easier to understand the diff. No need to update it now, but I'm copy-pasting some sentences here just to show what I mean, what I think is enough, yet useful for reviewing and also enough to have in the commit log after merging:
Fixes a regression introduced in PR #445, which allowed a message from a replica
to update the slot ownership of its primary. The regression results in a
replicaof cycle, causing server crashes due to the cycle detection assert. The
fix restores the previous behavior where only primary senders can trigger
clusterUpdateSlotsConfigWith.
Additional changes:
* Handling of primaries without slots is obsoleted by new handling of when a
sender that was a replica announces that it is now a primary.
* Replication loop detection code is unchanged but shifted downwards.
* Some variables are renamed for better readability and some are introduced to
avoid repeated memcmp() calls.
Thanks @zuiderkwast!
Let me start a new PR on this one. It is probably easier that way.
Got it. My bad. I misunderstood. Reviewing commits is a good point.
Make sense. I can do that too. |
I want it in a separate PR. I'm going to fairly strongly disagree with Viktor that I think in this specific case the refactor made the change harder for me to follow. I think that is because I'm more familiar with the code, outside of the newly introduced variables I found everything harder to follow. |
Yeah, allowing replicas to update slot ownership on behalf of their primaries is still the change that makes me feel uncomfortable. In the past the algorithm was to just trust primaries as much as possible. |
This is likely proximity bias :). I would suggest also reviewing the function without diff'ing line by line, and then back to the diff. This is how I come to being at peace with it. Let me know what you think afterwards. |
Everyone is biased. |
I'll stand behind my original statement that the refactoring made evaluating the technical changes more difficult, and would still prefer it in a separate PR. I also think that is a better broad decision to take as a team. The problem with a separate commit is that we squash and merge, and lose the commit. In some cases you are introducing new concepts, and for that the new names make sense. But in other it just seems like you are renaming variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change makes sense. We probably should avoid taking input from the replica about three state of its primaries as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I can do a separate PR in the future.
Will clean this PR up in a bit. |
Signed-off-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Fixes a regression introduced in PR #445, which allowed a message from a replica
to update the slot ownership of its primary. The regression results in a
replicaof
cycle, causing server crashes due to the cycle detection assert. Thefix restores the previous behavior where only primary senders can trigger
clusterUpdateSlotsConfigWith
.Additional changes:
sender that was a replica announces that it is now a primary.
avoid repeated memcmp() calls.
Fixes #753.