Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add atomic recursive delete to ZK client and use for drop instance #2994

Merged
merged 6 commits into from
Jan 28, 2025

Conversation

GrantPSpencer
Copy link
Contributor

@GrantPSpencer GrantPSpencer commented Jan 13, 2025

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Feature warranted by discussion from PR #2932
Discussion here: #2932 (comment)

Description

  • Here are some details about my PR, including screenshots of any UI changes:

This PR adds an ATOMIC recursive delete method and uses it for dropping/purging instances from the cluster. The deleteRecursivelyAtomic method will either fully fail or fully succeed. This will prevent partially deleted instances from entering the cluster into an invalid state. This behaves similarly to the current deleteRecursively method, but with the addition of atomicity. There are two signatures available, one for deleting a single path and its children, and one for deleting multiple paths and all their children. Atomicity across different paths is needed to ensure atomicity of dropping an instance.

The FederatedZkClient implementation throws an IllegalArgumentException when deleteRecursivelyAtomic is called upon 2 paths that exist in different ZK realms as atomicity cannot be guaranteed across different realms due to their being 2 separate ZKClients and therefore 2 separate calls.

Tests

  • The following tests are written for this issue:

testDropInstance in TestZkHelixAdmin.java
testDeleteRecursivelyAtomic in TestRawZkClient.java

  • The following is the result of the "mvn test" command on the appropriate module:

Ran manual CI on my personal fork and saw no new failing tests introduced.

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

N/A

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

try {
multi(ops);
}
catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I feel like we should check the multi result. Even if there are exception thrown, multi may or may not succeeded.
  2. I think having specific exception class might be better comparing to do a global catch here, and log corresponding error messages . For example, if we get a InterruptedException, we know there is concurrent edit of the paths.

related doc:
OpResult: https://www.javadoc.io/static/org.apache.zookeeper/zookeeper/3.7.1/org/apache/zookeeper/OpResult.DeleteResult.html
Multi(): https://www.javadoc.io/doc/org.apache.zookeeper/zookeeper/3.7.1/org/apache/zookeeper/ZooKeeper.html#multi-java.lang.Iterable-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Added check for any ErrorResults in response from multi() call. The class only provides int values that correspond to specific ZK KeeperExceptions. We can convert that int value to a KeeperException.Code and provide a list of when we log + throw error.
  2. The goal with the global catch was to follow the same pattern as the original deleteRecursively method which global catches and throws a ZkClientException. Do you think specific error catching is necessary in this case? My understanding is that the original exception type (lets say InterruptedException) will still be printed to log and the cause will still be preserved in the cause of the new ZkClientException. The main draw back is that callers of this API will not be able to handle specific failures themselves, but will only be able to catch and respond to ZkClientException

Thank you for the feedback @xyuanlu 🙏

@xyuanlu
Copy link
Contributor

xyuanlu commented Jan 23, 2025

Generally LGTM. Please address final comments.

@GrantPSpencer
Copy link
Contributor Author

Pull request approved by: @xyuanlu
Commit message: Add atomic recursive delete to ZK client and use for drop instance

@xyuanlu xyuanlu merged commit 892fc27 into apache:master Jan 28, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants