Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure only primary sender drives slot ownership updates #754

Merged
merged 9 commits into from
Jul 16, 2024

Conversation

PingXie
Copy link
Member

@PingXie PingXie commented Jul 7, 2024

Fixes a regression introduced in PR #445, which allowed a message from a replica
to update the slot ownership of its primary. The regression results in a
replicaof cycle, causing server crashes due to the cycle detection assert. The
fix restores the previous behavior where only primary senders can trigger
clusterUpdateSlotsConfigWith.

Additional changes:

  • Handling of primaries without slots is obsoleted by new handling of when a
    sender that was a replica announces that it is now a primary.
  • Replication loop detection code is unchanged but shifted downwards.
  • Some variables are renamed for better readability and some are introduced to
    avoid repeated memcmp() calls.

Fixes #753.

Copy link

codecov bot commented Jul 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.28%. Comparing base (1a8bd04) to head (a1f0bc0).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #754      +/-   ##
============================================
+ Coverage     70.25%   70.28%   +0.03%     
============================================
  Files           112      112              
  Lines         60590    60587       -3     
============================================
+ Hits          42567    42586      +19     
+ Misses        18023    18001      -22     
Files Coverage Δ
src/cluster_legacy.c 86.00% <100.00%> (+0.35%) ⬆️

... and 11 files with indirect coverage changes

@PingXie PingXie self-assigned this Jul 7, 2024
Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question from the issue (repeating here):

Is it any problematic to handle topology update from replica with higher config epoch?

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems to make sense but I don't fully understand the fix and how it solves the problem. The essence of the problem can be seen in the diff screenshot in #753? Please update the PR description to independently describe the change, rather than only referring to the issue.

It's harder to review the fix when there's refactoring in the same commit. It's easier if the refacoring and the acutal fix are at least in two separate commits.

(I imagine Oran Agra would have rejected the renaming of variables, to avoid any kind of unnecessary changes.)

@PingXie
Copy link
Member Author

PingXie commented Jul 10, 2024

Is it any problematic to handle topology update from replica with higher config epoch?

Yeah I think it will work but I feel that it might involve a bigger change. This change on the other hand is more scoped, relatively speaking. We can definitely explore this idea separately/incrementally.

@PingXie
Copy link
Member Author

PingXie commented Jul 10, 2024

The change seems to make sense but I don't fully understand the fix and how it solves the problem. The essence of the problem can be seen in the diff screenshot in #753? Please update the PR description to independently describe the change, rather than only referring to the issue.

Yes - replicas updating the slot ownership on behalf of their primaries is the regression (or behavior change) introduced by #445.

It's harder to review the fix when there's refactoring in the same commit. It's easier if the refacoring and the acutal fix are at least in two separate commits.

(I imagine Oran Agra would have rejected the renaming of variables, to avoid any kind of unnecessary changes.)

I don't agree with the "unnecessary changes". All these changes are either a continuation/fine-tune of #445 or for a good reason of improving readability. I have gone through this code path more times than I can count but I still get lost easily because the naming is just so confusing. A blank statement of "unnecessary changes" is just saying "I don't know what I am doing" - sorry for the unnecessarily strong response :)

@zuiderkwast
Copy link
Contributor

It's harder to review the fix when there's refactoring in the same commit. It's easier if the refacoring and the acutal fix are at least in two separate commits.
(I imagine Oran Agra would have rejected the renaming of variables, to avoid any kind of unnecessary changes.)

I don't agree with the "unnecessary changes". All these changes are either a continuation/fine-tune of #445 or for a good reason of improving readability. I have gone through this code path more times than I can count but I still get lost easily because the naming is just so confusing.

Sorry for causing frustration! I know you've very familiar with this code. More so than the rest of us.

With "unnecessary changes", I just meant changes that don't change the logic, i.e. refactoring, not strictly necessary to fix the bug. I didn't say I'm against it. Only that I'd prefer them in separate commits, since it would have made it easier for me to read the diff and spot the actual behaviour change. The diff is +68 −82, while the actual fix is ~5 lines.

Even the title "ensure only primary sender drives..." seemed a bit cryptic to me at first, but of course it's obvious to anyone who has this code fresh in memory. (My though processes involved "What's a primary sender again... Hm.. right, it's when we're receiving a packet on the cluster bus from a primary.")

I'm trying to make reviewing easier for myself by requiring more from the contributors. Not sure if it's sane but I want to be able to handle more PRs faster without being burnt out. I should know you're in the same position.

@PingXie
Copy link
Member Author

PingXie commented Jul 11, 2024

Sorry for causing frustration! I know you've very familiar with this code. More so than the rest of us.

No worries - we are all doing our jobs.

With "unnecessary changes", I just meant changes that don't change the logic, i.e. refactoring, not strictly necessary to fix the bug. I didn't say I'm against it. Only that I'd prefer them in separate commits, since it would have made it easier for me to read the diff and spot the actual behaviour change. The diff is +68 −82, while the actual fix is ~5 lines.

I disagree. I don't consider this a refactoring, which I agree is better handled in a separate PR. There is a reason why I introduced the regression in the first place. Our discussions on the names show the exact problem of this code not having established clear mental concepts and this was partly the reason why I introduced the regression in the first place. Making these changes in a separate PR loses the exact context of why these changes are needed. And refactoring PRs don't normally get the right amount of scrutinization, which is what I want for this PR. I am all for any nitpick that could help improve the code quality in such a critical part of the code base.

Even the title "ensure only primary sender drives..." seemed a bit cryptic to me at first, but of course it's obvious to anyone who has this code fresh in memory. (My though processes involved "What's a primary sender again... Hm.. right, it's when we're receiving a packet on the cluster bus from a primary.")

I'm trying to make reviewing easier for myself by requiring more from the contributors. Not sure if it's sane but I want to be able to handle more PRs faster without being burnt out. I should know you're in the same position.

I believe I have explained the issue to my best ability in #753, which I linked to this PR as well. Have you got a chance to check it out? I have a tendency to split the bug and PR. Will make sure to carry the context over for this (and future) PR.

@PingXie
Copy link
Member Author

PingXie commented Jul 11, 2024

I have updated the names per recommendation. I think they look clearer now. @zuiderkwast PTAL.

Also I think there is another (0-day) bug. We never check the sender's config epoch before accepting its claim of being a primary. I will address it in a separate PR :).

https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L3119

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! These "sender-claimed" names definitely make things more clear.

Also I think there is another (0-day) bug. We never check the sender's config epoch before accepting its claim of being a primary. I will address it in a separate PR :).

OK. I wouldn't mind having it in the same PR (just a separate commit within the PR for easier reading). But a separate PR is also good to if we want to backport it or something.

I don't consider this a refactoring, which I agree is better handled in a separate PR.

I never said separate PR. That would be excessive. I just meant separate commits within the same PR. It might have made it easier to review, though this may just have been me trying to find excuses for not understanding the fix.

I believe I have explained the issue to my best ability in #753, which I linked to this PR as well. Have you got a chance to check it out? I have a tendency to split the bug and PR. Will make sure to carry the context over for this (and future) PR.

Yes, the issue is very good. The problem description and the description of the change are not the same things though. In the PR I think it's good to describe the actual change. The problem description doesn't need to be repeated in detail in the PR when it's explained in the issue.

For the PR description, I'd be happy just what you wrote under "Fix:", and possibly mentioning the additional changes to make it easier to understand the diff. No need to update it now, but I'm copy-pasting some sentences here just to show what I mean, what I think is enough, yet useful for reviewing and also enough to have in the commit log after merging:

Fixes a regression introduced in PR #445, which allowed a message from a replica
to update the slot ownership of its primary. The regression results in a
replicaof cycle, causing server crashes due to the cycle detection assert. The
fix restores the previous behavior where only primary senders can trigger
clusterUpdateSlotsConfigWith.

Additional changes:

* Handling of primaries without slots is obsoleted by new handling of when a
  sender that was a replica announces that it is now a primary.
* Replication loop detection code is unchanged but shifted downwards.
* Some variables are renamed for better readability and some are introduced to
  avoid repeated memcmp() calls.

@PingXie
Copy link
Member Author

PingXie commented Jul 12, 2024

Thanks @zuiderkwast!

OK. I wouldn't mind having it in the same PR (just a separate commit within the PR for easier reading). But a separate PR is also good to if we want to backport it or something.

Let me start a new PR on this one. It is probably easier that way.

I never said separate PR. That would be excessive. I just meant separate commits within the same PR. It might have made it easier to review, though this may just have been me trying to find excuses for not understanding the fix.

Got it. My bad. I misunderstood. Reviewing commits is a good point.

The problem description and the description of the change are not the same things though. In the PR I think it's good to describe the actual change. The problem description doesn't need to be repeated in detail in the PR when it's explained in the issue.

For the PR description, I'd be happy just what you wrote under "Fix:", and possibly mentioning the additional changes to make it easier to understand the diff. No need to update it now, but I'm copy-pasting some sentences here just to show what I mean, what I think is enough, yet useful for reviewing and also enough to have in the commit log after merging

Make sense. I can do that too.

@madolson
Copy link
Member

OK. I wouldn't mind having it in the same PR (just a separate commit within the PR for easier reading). But a separate PR is also good to if we want to backport it or something.

I want it in a separate PR. I'm going to fairly strongly disagree with Viktor that I think in this specific case the refactor made the change harder for me to follow. I think that is because I'm more familiar with the code, outside of the newly introduced variables I found everything harder to follow.

@madolson
Copy link
Member

madolson commented Jul 12, 2024

Yes - replicas updating the slot ownership on behalf of their primaries is the regression (or behavior change) introduced by #445.

Yeah, allowing replicas to update slot ownership on behalf of their primaries is still the change that makes me feel uncomfortable. In the past the algorithm was to just trust primaries as much as possible.

@PingXie
Copy link
Member Author

PingXie commented Jul 12, 2024

I think that is because I'm more familiar with the code, outside of the newly introduced variables I found everything harder to follow

This is likely proximity bias :). I would suggest also reviewing the function without diff'ing line by line, and then back to the diff. This is how I come to being at peace with it. Let me know what you think afterwards.

@madolson
Copy link
Member

This is likely proximity bias

Everyone is biased.

@madolson
Copy link
Member

Let me know what you think afterwards.

I'll stand behind my original statement that the refactoring made evaluating the technical changes more difficult, and would still prefer it in a separate PR. I also think that is a better broad decision to take as a team. The problem with a separate commit is that we squash and merge, and lose the commit.

In some cases you are introducing new concepts, and for that the new names make sense. But in other it just seems like you are renaming variables.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change makes sense. We probably should avoid taking input from the replica about three state of its primaries as much as possible.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM.

@PingXie @madolson Could we converge and close the PR out? I will further merge unstable to #573 and get it ready.

@PingXie
Copy link
Member Author

PingXie commented Jul 15, 2024

I'll stand behind my original statement that the refactoring made evaluating the technical changes more difficult, and would still prefer it in a separate PR. I also think that is a better broad decision to take as a team. The problem with a separate commit is that we squash and merge, and lose the commit.

Sure. I can do a separate PR in the future.

Could we converge and close the PR out?

Will clean this PR up in a bit.

Signed-off-by: Ping Xie <[email protected]>
@PingXie PingXie merged commit 66d0f7d into valkey-io:unstable Jul 16, 2024
20 checks passed
@PingXie PingXie deleted the replicaof-cycle branch July 16, 2024 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression from PR #445 Incorrectly Allows Slot Ownership Updates via Replica
4 participants