Add auto deregistration of offline participants after timeout #2932

GrantPSpencer · 2024-10-01T18:30:58Z

Issues

My PR addresses the following Helix issues and references them in the PR description:
New feature for allowing controller to purge participants that have been offline for greater than user defined timeout.

Description

Here are some details about my PR, including screenshots of any UI changes:
Participants can automatically join a Helix cluster when they startup. However when they permanently go down, they must be manually removed or purged by an external workflow in order to actually leave the cluster. These stale participants can have significant negative impact on the clusters in at least 2 ways:

MAX_OFFLINE_INSTANCES_ALLOWED - If this cluster level config is exceeded, then the cluster will be put into maintenance mode.
CRUSHED Calculations - CRUSHED only guarantees evenness when all nodes in a cluster are online. The more offline nodes in the cluster, the larger the max degree of unevenness that is possible.
This causes Helix's view of the cluster's health and the actual health of the cluster to diverge.

Code Changes:

Added PARTICIPANT_DEREGISTRATION_ENABLED and PARTICIPANT_DEREGISTRATION_TIMEOUT properties to ClusterConfig.
Added ParticipantDeregistration Stage to handle logic of removing participants that have been offline greater than customer configured timeout.
- This stage will schedule a follow-up onDemandRebalance to deregister participants that are currently offline but have not yet exceeded the deregister timeout
Updated ZkTestBase with addParticipant and dropParticipant methods to be leveraged across different test classes.
- Subsequently changed TestAddResourceWhenRequireDelayedRebalanceOverwrite.java and TestForceKillInstance.java to leverage these changes.

Tests

The following tests are written for this issue:
TestParticipantDeregistrationStage.java
The following is the result of the "mvn test" command on the appropriate module:

$ mvn test -o -Dtest=TestParticipantDeregistrationStage -pl=helix-core

[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:32 min
[INFO] Finished at: 2025-01-27T14:19:05-08:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

N/A. Feature is optional and is default off

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java

helix-core/src/main/java/org/apache/helix/model/ClusterConfig.java

helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java

GrantPSpencer · 2024-11-21T00:31:58Z

Hi @junkaixue, sorry for the huge delay on responding to feedback. Had 2x oncall rotations, then vacation and got sick. Please take a look at this when you can. Ideally, we would like this to get incorporated into the upcoming release if possible

junkaixue

Only concern is the drop instance API. feel free to discuss with @xyuanlu

junkaixue · 2024-12-03T20:16:46Z

helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java

+    long deregisterDelay = clusterConfig.getParticipantDeregistrationTimeout();
+    long stageStartTime = System.currentTimeMillis();
+    Set<String> participantsToDeregister = new HashSet<>();
+    long nextDeregisterTime = Long.MAX_VALUE;


use -1 instead of max long? May have overflow issue.

Updated to use -1 The DelayRebalance logic utilizes Long.MAX_VALUE. We should consider updating it in the future

helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java

xyuanlu · 2025-01-19T05:24:00Z

LGTM. Only comments are for styles.
Also please update the PR description.

xyuanlu

LGTM!

GrantPSpencer · 2025-01-30T01:37:37Z

@junkaixue Could you please give this PR another review whenever you have time?

GrantPSpencer · 2025-01-30T01:37:58Z

CI failed due to flaky test #2906

junkaixue reviewed Oct 1, 2024

View reviewed changes

helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java Outdated Show resolved Hide resolved

helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java Outdated Show resolved Hide resolved

GrantPSpencer force-pushed the participant-auto-deregistration branch from 7d1c222 to ea94999 Compare October 4, 2024 21:25

junkaixue reviewed Oct 7, 2024

View reviewed changes

Add auto deregistration of offline participants after timeout

8d2be71

GrantPSpencer force-pushed the participant-auto-deregistration branch from ea94999 to 8e2e7ae Compare November 21, 2024 00:07

respond reviewer feedback

45190e9

GrantPSpencer force-pushed the participant-auto-deregistration branch from 8e2e7ae to 45190e9 Compare November 21, 2024 17:39

junkaixue reviewed Dec 3, 2024

View reviewed changes

GrantPSpencer mentioned this pull request Jan 13, 2025

Add atomic recursive delete to ZK client and use for drop instance #2994

Merged

3 tasks

xyuanlu reviewed Jan 19, 2025

View reviewed changes

helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java Outdated Show resolved Hide resolved

xyuanlu reviewed Jan 19, 2025

View reviewed changes

helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java Outdated Show resolved Hide resolved

GrantPSpencer added 5 commits January 27, 2025 14:13

cleanup logging

9463094

cleanup logging2

02b7c4b

respond feedback

c3e8aea

use -1 istead of long max value

fd3dd4b

formatting

bce0ccf

xyuanlu approved these changes Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto deregistration of offline participants after timeout #2932

Add auto deregistration of offline participants after timeout #2932

GrantPSpencer commented Oct 1, 2024 •

edited

Loading

GrantPSpencer commented Nov 21, 2024

junkaixue left a comment

junkaixue Dec 3, 2024

GrantPSpencer Jan 28, 2025

xyuanlu commented Jan 19, 2025

xyuanlu left a comment

GrantPSpencer commented Jan 30, 2025

GrantPSpencer commented Jan 30, 2025

Add auto deregistration of offline participants after timeout #2932

Are you sure you want to change the base?

Add auto deregistration of offline participants after timeout #2932

Conversation

GrantPSpencer commented Oct 1, 2024 • edited Loading

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Commits

Code Quality

GrantPSpencer commented Nov 21, 2024

junkaixue left a comment

Choose a reason for hiding this comment

junkaixue Dec 3, 2024

Choose a reason for hiding this comment

GrantPSpencer Jan 28, 2025

Choose a reason for hiding this comment

xyuanlu commented Jan 19, 2025

xyuanlu left a comment

Choose a reason for hiding this comment

GrantPSpencer commented Jan 30, 2025

GrantPSpencer commented Jan 30, 2025

GrantPSpencer commented Oct 1, 2024 •

edited

Loading