Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only primary with slots has the right to mark a node as failed #634

Merged
merged 7 commits into from
Jun 17, 2024

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Jun 12, 2024

In markNodeAsFailingIfNeeded we will count needed_quorum and failures,
needed_quorum is the half the cluster->size and plus one, and cluster-size
is the size of primary node which contain slots, but when counting
failures, we dit not check if primary has slots.

Only the primary has slots that has the rights to vote, adding a new
clusterNodeIsVotingPrimary to formalize this concept.

Release notes:

bugfix where nodes not in the quorum group might spuriously mark nodes as failed

In markNodeAsFailingIfNeeded we will count needed_quorum and failures,
needed_quorum is the half the cluster->size and plus one, and cluster-size
is the size of primary node which contain slots, but when counting
failures, we dit not check if primary has slots.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin requested review from PingXie and madolson June 12, 2024 08:00
Copy link

codecov bot commented Jun 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.16%. Comparing base (5d9d418) to head (acfdbb8).
Report is 2 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #634      +/-   ##
============================================
- Coverage     70.19%   70.16%   -0.03%     
============================================
  Files           110      110              
  Lines         60049    60050       +1     
============================================
- Hits          42149    42135      -14     
- Misses        17900    17915      +15     
Files Coverage Δ
src/cluster_legacy.c 85.80% <100.00%> (+0.17%) ⬆️

... and 7 files with indirect coverage changes

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Thanks!

The cluster spec mentions that the majority of primaries need to flag it as PFAIL to become FAIL. We need to update that text. It needs to count only the primaries with the right to vote, which are the ones counted in the cluster size, as you mentioned.

I think this bug can cause a minority of voting primaries with help from empty primaries to trigger a failover election. If this minority is in a netsplit, the failover can't succeed and the replicas will try to get elected over and over.

@CharlesChen888
Copy link
Member

Technically, the detection of nodes with or without slots are equally effective, so I don't think it is a good idea to just ignore the votes of nodes with no slots.

Perhaps we can change the definition of needed_quorum to half of the total count of primary nodes plus one. In this case, we don't need to make sure slots are distributed among odd number of nodes, we can distribute slots among even number of nodes and add one extra node to ensure voting is effective.

@enjoy-binbin
Copy link
Member Author

i remember @zuiderkwast had a voting replicas idea? if i remember it correctly.
it is a good idea that empty primary has the right to vote, but i think it is a another topic. and in this PR i want to fix the definition (count in cluster size).

we don't need to make sure slots are distributed among odd number of nodes, we can distribute slots among even number of nodes and add one extra node to ensure voting is effective.

In Tencent Cloud, we have a similar design, we flag a node as an arbiter node, it is a empty primary (don't has slots), it has the right to vote. so in a one shard cluster (or in non-cluster mode, we converted the primary-replica mode to a single-shard cluster with 16384 slots), we will doing this: one shard (16384 slots), two arbiter nodes (0 slots with the right to vote), so it can meet to the quorum to do the failover. i think i can publish it if your guys are interested in adding it or we can just add a new configuration and then make the empty primary has the right to vote.

@madolson
Copy link
Member

Perhaps we can change the definition of needed_quorum to half of the total count of primary nodes plus one. In this case, we don't need to make sure slots are distributed among odd number of nodes, we can distribute slots among even number of nodes and add one extra node to ensure voting is effective.

Once a cluster is in steady state, nodes can be added without consensus, which means that nodes would disagree about the quorum size at a given epoch when nodes are getting added. We need a way to identify that is a node is part of the cluster config at a given epoch, which is done today when we migrate a slot into the new primary.

i think i can publish it if your guys are interested in adding it or we can just add a new configuration and then make the empty primary has the right to vote.

I would rather have some type of config that indicates a node is able to participate in the quorum and it's attached to the epoch. I don't like the idea of having primaries without slots just arbitrarily participate. It's really no different than just letting everyone vote.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Viktors minor suggestion, but the general change seems good to me.

@madolson madolson added the release-notes This issue should get a line item in the release notes label Jun 13, 2024
Signed-off-by: Binbin <[email protected]>
@PingXie
Copy link
Member

PingXie commented Jun 13, 2024

We need a way to identify that is a node is part of the cluster config at a given epoch, which is done today when we migrate a slot into the new primary.

I don't think we consult with the epochs when computing the quorum size currently. The dynamic quorum issue is a real problem of the current cluster design/implementation (even if we count non-empty primaries only).

I would rather have some type of config that indicates a node is able to participate in the quorum and it's attached to the epoch. I don't like the idea of having primaries without slots just arbitrarily participate. It's really no different than just letting everyone vote.

+1. I like this idea. We should explore further next :-)

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general.

Signed-off-by: Binbin <[email protected]>

Co-authored-by: Ping Xie <[email protected]>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix! Good test case!

What to mention in release notes? Shall we mark it as a bugfix?

@madolson
Copy link
Member

I don't think we consult with the epochs when computing the quorum size currently.

We do.. We include a node in the quorum if it has slots and has an epoch greater than our own. On the flip side the primary receiving the request will compare epochs to see if it can vote.

@madolson
Copy link
Member

What to mention in release notes? Shall we mark it as a bugfix?

Mentioning it's a bugfix where nodes not in the quorum group might spuriously mark nodes as failed.

@madolson madolson merged commit db6d3c1 into valkey-io:unstable Jun 17, 2024
18 checks passed
@enjoy-binbin enjoy-binbin deleted the mark_fail branch June 17, 2024 03:49
PingXie added a commit to PingXie/valkey that referenced this pull request Jul 8, 2024
…y-io#634)

In markNodeAsFailingIfNeeded we will count needed_quorum and failures,
needed_quorum is the half the cluster->size and plus one, and
cluster-size
is the size of primary node which contain slots, but when counting
failures, we dit not check if primary has slots.

Only the primary has slots that has the rights to vote, adding a new
clusterNodeIsVotingPrimary to formalize this concept.

Release notes:

bugfix where nodes not in the quorum group might spuriously mark nodes
as failed

---------

Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
PingXie added a commit to PingXie/valkey that referenced this pull request Jul 9, 2024
…y-io#634)

In markNodeAsFailingIfNeeded we will count needed_quorum and failures,
needed_quorum is the half the cluster->size and plus one, and
cluster-size
is the size of primary node which contain slots, but when counting
failures, we dit not check if primary has slots.

Only the primary has slots that has the rights to vote, adding a new
clusterNodeIsVotingPrimary to formalize this concept.

Release notes:

bugfix where nodes not in the quorum group might spuriously mark nodes
as failed

---------

Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
PingXie added a commit to PingXie/valkey that referenced this pull request Jul 9, 2024
…y-io#634)

In markNodeAsFailingIfNeeded we will count needed_quorum and failures,
needed_quorum is the half the cluster->size and plus one, and
cluster-size
is the size of primary node which contain slots, but when counting
failures, we dit not check if primary has slots.

Only the primary has slots that has the rights to vote, adding a new
clusterNodeIsVotingPrimary to formalize this concept.

Release notes:

bugfix where nodes not in the quorum group might spuriously mark nodes
as failed

---------

Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
PingXie added a commit that referenced this pull request Jul 10, 2024
In markNodeAsFailingIfNeeded we will count needed_quorum and failures,
needed_quorum is the half the cluster->size and plus one, and
cluster-size
is the size of primary node which contain slots, but when counting
failures, we dit not check if primary has slots.

Only the primary has slots that has the rights to vote, adding a new
clusterNodeIsVotingPrimary to formalize this concept.

Release notes:

bugfix where nodes not in the quorum group might spuriously mark nodes
as failed

---------

Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster release-notes This issue should get a line item in the release notes
Projects
Status: Backported
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants