Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1744] Improve handling of null value edge cases when querying Helix #3603

Merged

Conversation

homatthew
Copy link
Contributor

@homatthew homatthew commented Nov 18, 2022

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

I made 2 helix null checks in 2 separate spots

(1) HelixAssignedParticipantCheck


In production, we've seen that the helix assigned participant check failed due to helix issues but not due to a split brain. When helix returns null, this actually means that the data does not exist. This is an unexpected case and we can assume that Helix itself is having issues (i.e. not a Gobblin side issue).

I am adding this log because if the Helix assigned participant check fails, this is most likely a Helix issue but it's not immediately obvious what the exact issue is. I've added 2 likely scenarios we've seen internally as common scenarios where oncall has seen this as the rootcause.

(2) HelixUtils#getWorkflowIdsFromJobNames(HelixManager helixManager, Collection jobNames)


kafka-streaming-replanner-tracking INFO - Caused by: java.lang.NullPointerException
kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:327)
kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.prototype.kafka.replanner.GobblinHelixClusterJob.getWorkflowIdFromJobName(GobblinHelixClusterJob.java:142)
kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.prototype.kafka.replanner.GobblinHelixClusterJob.getJobActive(GobblinHelixClusterJob.java:175)
kafka-streaming-replanner-tracking INFO - ... 54 more

This is a similar case where Helix returns a null value. This can be caused when this util is called during a replanner / restart of the helix workflow. It can also be caused by a helix data consistency issue. The code doesn't expect a null and will fail with NPE. It is much better to fail with a more descriptive exception. The code calling this util already retries, so this is to avoid any confusion about if this is an infra issue or a bug in our code.

I've also made a small change to how we fetch the workflow configs. In the original implementation, we call the getWorkflowConfig() again after getting the workflow map. I think this is causing some weird inconsistent state. We get a workflow and then if it is somehow deleted by the time we call getWorkflowConfig(), then we end up with a null value.

There's really no reason to split up the call since they do the same thing and we are just being wasteful by making extra zookeeper reads.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

HelixAssignedParticipantCheckTest

The existing test in helix assigned participant check triggers this because it returns a null participant from mock helix. I've attached the corresponding log output that we should see.

2022-11-17 18:23:02 PST ERROR [pool-35-thread-1] org.apache.gobblin.cluster.HelixAssignedParticipantCheck 143 - The current assigned participant is null. This implies that 
		(a)Helix failed to write to zookeeper, which is often caused by lack of compression leading / exceeding zookeeper jute max buffer size (Default 1MB)
		(b)Helix reassigned the task (unlikely if this current task has been running without issue. Helix does not have code for reassigning "running" tasks)
Note: This logic is true as of Helix version 1.0.2 and ZK version 3.6

This test is not run in CI so I attached a screenshot below
image

HelixUtilsTest

I've also added tests in the HelixUtilsTest. I mock Helix API responses and replicate a workflow -> job -> task DAG.
image

org.apache.gobblin.cluster.GobblinHelixUnexpectedStateException: Received null workflow config from Helix. We should not see any null configs when reading all workflows. workflowId=null-workflow-to-throw-exception
	at org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:410)
	at org.apache.gobblin.cluster.HelixUtilsTest.testGetWorkunitIdForJobNamesWithInvalidHelixState(HelixUtilsTest.java:149)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...

ClusterIntegrationTest
The cluster integration test tests the replanner and job clean up. The workflowIdFromJobName is used in the replanner to check the helix workflow to restart. So I am pretty sure I didn't break anything. Not sure if this integration test runs in the CI so I've added a screenshot of the tests passing.
image

GobblinHelixJobSchedulerTest
image

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Nov 18, 2022

Codecov Report

Merging #3603 (8ce90a9) into master (cb2668e) will increase coverage by 1.61%.
The diff coverage is 56.25%.

@@             Coverage Diff              @@
##             master    #3603      +/-   ##
============================================
+ Coverage     46.55%   48.16%   +1.61%     
+ Complexity    10649     7828    -2821     
============================================
  Files          2129     1474     -655     
  Lines         83295    58094   -25201     
  Branches       9281     6684    -2597     
============================================
- Hits          38777    27982   -10795     
+ Misses        40954    27507   -13447     
+ Partials       3564     2605     -959     
Impacted Files Coverage Δ
...gobblin/cluster/HelixAssignedParticipantCheck.java 0.00% <0.00%> (ø)
.../cluster/GobblinHelixUnexpectedStateException.java 100.00% <100.00%> (ø)
...in/java/org/apache/gobblin/cluster/HelixUtils.java 44.44% <100.00%> (+1.26%) ⬆️
.../gobblin/cluster/GobblinHelixTaskStateTracker.java 62.50% <0.00%> (-6.25%) ⬇️
...a/org/apache/gobblin/cluster/GobblinHelixTask.java 62.36% <0.00%> (-4.31%) ⬇️
...ava/org/apache/gobblin/writer/FsWriterMetrics.java
...ice/modules/orchestration/AzkabanClientStatus.java
...n/kafka/writer/AbstractKafkaDataWriterBuilder.java
.../modules/flowgraph/FlowGraphConfigurationKeys.java
...org/apache/gobblin/kafka/tool/KafkaCheckpoint.java
... and 653 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@homatthew homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch 2 times, most recently from ad1f16f to 4528cb8 Compare November 18, 2022 05:47
@homatthew homatthew changed the title [GOBBLIN-1744] Improve logging in null cases when querying from Helix [GOBBLIN-1744] Improve handling of null value edge cases when querying Helix Nov 18, 2022
@homatthew homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch from 4528cb8 to de633cc Compare November 18, 2022 05:55
}

return isAssignedParticipant;
if (participant == null) {
Copy link
Contributor Author

@homatthew homatthew Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read the PR description to understand why these null checks are needed in the PR and how I tested these changes. These pieces of code are important and critical pieces with a fair amount of nuance related to helix.

@homatthew homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch 2 times, most recently from 369c7f9 to ea6d6e8 Compare November 18, 2022 06:54
@homatthew homatthew marked this pull request as ready for review November 18, 2022 07:27
Map<String, WorkflowConfig> workflowConfigMap = taskDriver.getWorkflows();
for (String workflow : workflowConfigMap.keySet()) {
WorkflowConfig workflowConfig = taskDriver.getWorkflowConfig(workflow);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previous version made 2 calls to to zookeeper when only 1 is necessary. Doing 2 calls makes it more likely to see inconsistent state.

Copy link
Contributor

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. If there is anything that's not expected from helix side, we should throw it out in util method, it's the caller's responsibility to decide how to handle the exception here

for (Map.Entry<String, WorkflowConfig> entry : workflowConfigMap.entrySet()) {
String workflow = entry.getKey();
WorkflowConfig workflowConfig = entry.getValue();
if (workflowConfig == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, if this is not expected, throw exception instead of just log it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made change to throw exception. I made the choice to create a new custom exception because there are other places where make API calls to ZK / Helix, and I want to use this exception to explicitly say there is a Helix issue.

The caller of this API (Job scheduler) is using a retryer that will automatically retry this API call again if there is an exception.

@homatthew homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch 2 times, most recently from cc5ec87 to 7ed8432 Compare January 10, 2023 23:31
@homatthew homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch from 7ed8432 to 8ce90a9 Compare January 11, 2023 00:02
Copy link
Contributor

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for the detail context here!

@ZihanLi58 ZihanLi58 merged commit 6aaf485 into apache:master Jan 11, 2023
phet added a commit to phet/gobblin that referenced this pull request Jan 18, 2023
* upstream/master:
  upgrade hadoop version (apache#3621)
  [GOBBLIN-1744] Improve handling of null value edge cases when querying Helix (apache#3603)
  Handle d2 markup case when not only announcing leader (apache#3622)
  Update Gobblin OSS Slack channel link to a never-expire link (apache#3620)
  [GOBBLIN-1750] Add schemas for observability events in GaaS (apache#3610)
  [GOBBLIN-1757]Refactor manifest, add reader/writer and iterator for efficient reading (apache#3618)
  Disable flaky HiveMaterializerTest on CI/CD (apache#3619)
  [GOBBLIN-1754] Fixes for mysql store change monitors (apache#3615)
  [GOBBLIN-1756] Fix the issue that we skipping flows for multihop jobs (apache#3617)
  [GOBBLIN-1752] Fix race condition where FSTemplateCatalog would update at the same t… (apache#3612)
  [GOBBLIN-1753] Migrate DB connection pool from o.a.commons.dbcp/dbcp2 to HikariCP (apache#3613)
  [GOBBLIN-1748] Add logs to debug multi-hop flows creation, progression, and cleanup (apache#3608)
Will-Lo added a commit to Will-Lo/incubator-gobblin that referenced this pull request Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants