Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add auto deregistration of offline participants after timeout #2932

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

GrantPSpencer
Copy link
Contributor

@GrantPSpencer GrantPSpencer commented Oct 1, 2024

Issues

  • My PR addresses the following Helix issues and references them in the PR description:
    New feature for allowing controller to purge participants that have been offline for greater than user defined timeout.

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Participants can automatically join a Helix cluster when they startup. However when they permanently go down, they must be manually removed or purged by an external workflow in order to actually leave the cluster. These stale participants can have significant negative impact on the clusters in at least 2 ways:
  • MAX_OFFLINE_INSTANCES_ALLOWED - If this cluster level config is exceeded, then the cluster will be put into maintenance mode.
  • CRUSHED Calculations - CRUSHED only guarantees evenness when all nodes in a cluster are online. The more offline nodes in the cluster, the larger the max degree of unevenness that is possible.
    This causes Helix's view of the cluster's health and the actual health of the cluster to diverge.

Code Changes:

  • Added PARTICIPANT_DEREGISTRATION_ENABLED and PARTICIPANT_DEREGISTRATION_TIMEOUT properties to ClusterConfig.
  • Added ParticipantDeregistration Stage to handle logic of removing participants that have been offline greater than customer configured timeout.
    • This stage will schedule a follow-up onDemandRebalance to deregister participants that are currently offline but have not yet exceeded the deregister timeout
  • Updated ZkTestBase with addParticipant and dropParticipant methods to be leveraged across different test classes.
    • Subsequently changed TestAddResourceWhenRequireDelayedRebalanceOverwrite.java and TestForceKillInstance.java to leverage these changes.

Tests

  • The following tests are written for this issue:
    TestParticipantDeregistrationStage.java

  • The following is the result of the "mvn test" command on the appropriate module:

$ mvn test -o -Dtest=TestParticipantDeregistrationStage -pl=helix-core

[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:32 min
[INFO] Finished at: 2025-01-27T14:19:05-08:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

N/A. Feature is optional and is default off

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@GrantPSpencer GrantPSpencer force-pushed the participant-auto-deregistration branch from 7d1c222 to ea94999 Compare October 4, 2024 21:25
@GrantPSpencer GrantPSpencer force-pushed the participant-auto-deregistration branch from ea94999 to 8e2e7ae Compare November 21, 2024 00:07
@GrantPSpencer
Copy link
Contributor Author

Hi @junkaixue, sorry for the huge delay on responding to feedback. Had 2x oncall rotations, then vacation and got sick. Please take a look at this when you can. Ideally, we would like this to get incorporated into the upcoming release if possible

@GrantPSpencer GrantPSpencer force-pushed the participant-auto-deregistration branch from 8e2e7ae to 45190e9 Compare November 21, 2024 17:39
Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only concern is the drop instance API. feel free to discuss with @xyuanlu

long deregisterDelay = clusterConfig.getParticipantDeregistrationTimeout();
long stageStartTime = System.currentTimeMillis();
Set<String> participantsToDeregister = new HashSet<>();
long nextDeregisterTime = Long.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use -1 instead of max long? May have overflow issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use -1 The DelayRebalance logic utilizes Long.MAX_VALUE. We should consider updating it in the future

@xyuanlu
Copy link
Contributor

xyuanlu commented Jan 19, 2025

LGTM. Only comments are for styles.
Also please update the PR description.

Copy link
Contributor

@xyuanlu xyuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@GrantPSpencer
Copy link
Contributor Author

@junkaixue Could you please give this PR another review whenever you have time?

@GrantPSpencer
Copy link
Contributor Author

CI failed due to flaky test #2906

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants