-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add auto deregistration of offline participants after timeout #2932
base: master
Are you sure you want to change the base?
Add auto deregistration of offline participants after timeout #2932
Conversation
helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java
Outdated
Show resolved
Hide resolved
helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java
Outdated
Show resolved
Hide resolved
7d1c222
to
ea94999
Compare
helix-core/src/main/java/org/apache/helix/model/ClusterConfig.java
Outdated
Show resolved
Hide resolved
helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java
Show resolved
Hide resolved
helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java
Outdated
Show resolved
Hide resolved
helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java
Outdated
Show resolved
Hide resolved
ea94999
to
8e2e7ae
Compare
Hi @junkaixue, sorry for the huge delay on responding to feedback. Had 2x oncall rotations, then vacation and got sick. Please take a look at this when you can. Ideally, we would like this to get incorporated into the upcoming release if possible |
8e2e7ae
to
45190e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only concern is the drop instance API. feel free to discuss with @xyuanlu
long deregisterDelay = clusterConfig.getParticipantDeregistrationTimeout(); | ||
long stageStartTime = System.currentTimeMillis(); | ||
Set<String> participantsToDeregister = new HashSet<>(); | ||
long nextDeregisterTime = Long.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use -1 instead of max long? May have overflow issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use -1 The DelayRebalance logic utilizes Long.MAX_VALUE. We should consider updating it in the future
helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java
Show resolved
Hide resolved
helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java
Outdated
Show resolved
Hide resolved
helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java
Outdated
Show resolved
Hide resolved
LGTM. Only comments are for styles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@junkaixue Could you please give this PR another review whenever you have time? |
CI failed due to flaky test #2906 |
Issues
New feature for allowing controller to purge participants that have been offline for greater than user defined timeout.
Description
Participants can automatically join a Helix cluster when they startup. However when they permanently go down, they must be manually removed or purged by an external workflow in order to actually leave the cluster. These stale participants can have significant negative impact on the clusters in at least 2 ways:
MAX_OFFLINE_INSTANCES_ALLOWED
- If this cluster level config is exceeded, then the cluster will be put into maintenance mode.CRUSHED Calculations
- CRUSHED only guarantees evenness when all nodes in a cluster are online. The more offline nodes in the cluster, the larger the max degree of unevenness that is possible.This causes Helix's view of the cluster's health and the actual health of the cluster to diverge.
Code Changes:
PARTICIPANT_DEREGISTRATION_ENABLED
andPARTICIPANT_DEREGISTRATION_TIMEOUT
properties toClusterConfig
.ZkTestBase
with addParticipant and dropParticipant methods to be leveraged across different test classes.TestAddResourceWhenRequireDelayedRebalanceOverwrite.java
andTestForceKillInstance.java
to leverage these changes.Tests
The following tests are written for this issue:
TestParticipantDeregistrationStage.java
The following is the result of the "mvn test" command on the appropriate module:
Changes that Break Backward Compatibility (Optional)
N/A. Feature is optional and is default off
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)