[GOBBLIN-1744] Improve handling of null value edge cases when querying Helix #3603

homatthew · 2022-11-18T02:39:48Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-1744] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1744

Description

Here are some details about my PR, including screenshots (if applicable):

I made 2 helix null checks in 2 separate spots

(1) HelixAssignedParticipantCheck

In production, we've seen that the helix assigned participant check failed due to helix issues but not due to a split brain. When helix returns null, this actually means that the data does not exist. This is an unexpected case and we can assume that Helix itself is having issues (i.e. not a Gobblin side issue).

I am adding this log because if the Helix assigned participant check fails, this is most likely a Helix issue but it's not immediately obvious what the exact issue is. I've added 2 likely scenarios we've seen internally as common scenarios where oncall has seen this as the rootcause.

(2) HelixUtils#getWorkflowIdsFromJobNames(HelixManager helixManager, Collection jobNames)

kafka-streaming-replanner-tracking INFO - Caused by: java.lang.NullPointerException
kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:327)
kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.prototype.kafka.replanner.GobblinHelixClusterJob.getWorkflowIdFromJobName(GobblinHelixClusterJob.java:142)
kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.prototype.kafka.replanner.GobblinHelixClusterJob.getJobActive(GobblinHelixClusterJob.java:175)
kafka-streaming-replanner-tracking INFO - ... 54 more

This is a similar case where Helix returns a null value. This can be caused when this util is called during a replanner / restart of the helix workflow. It can also be caused by a helix data consistency issue. The code doesn't expect a null and will fail with NPE. It is much better to fail with a more descriptive exception. The code calling this util already retries, so this is to avoid any confusion about if this is an infra issue or a bug in our code.

I've also made a small change to how we fetch the workflow configs. In the original implementation, we call the getWorkflowConfig() again after getting the workflow map. I think this is causing some weird inconsistent state. We get a workflow and then if it is somehow deleted by the time we call getWorkflowConfig(), then we end up with a null value.

There's really no reason to split up the call since they do the same thing and we are just being wasteful by making extra zookeeper reads.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

HelixAssignedParticipantCheckTest

The existing test in helix assigned participant check triggers this because it returns a null participant from mock helix. I've attached the corresponding log output that we should see.

2022-11-17 18:23:02 PST ERROR [pool-35-thread-1] org.apache.gobblin.cluster.HelixAssignedParticipantCheck 143 - The current assigned participant is null. This implies that 
		(a)Helix failed to write to zookeeper, which is often caused by lack of compression leading / exceeding zookeeper jute max buffer size (Default 1MB)
		(b)Helix reassigned the task (unlikely if this current task has been running without issue. Helix does not have code for reassigning "running" tasks)
Note: This logic is true as of Helix version 1.0.2 and ZK version 3.6

This test is not run in CI so I attached a screenshot below

HelixUtilsTest

I've also added tests in the HelixUtilsTest. I mock Helix API responses and replicate a workflow -> job -> task DAG.

org.apache.gobblin.cluster.GobblinHelixUnexpectedStateException: Received null workflow config from Helix. We should not see any null configs when reading all workflows. workflowId=null-workflow-to-throw-exception
	at org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:410)
	at org.apache.gobblin.cluster.HelixUtilsTest.testGetWorkunitIdForJobNamesWithInvalidHelixState(HelixUtilsTest.java:149)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...

ClusterIntegrationTest
The cluster integration test tests the replanner and job clean up. The workflowIdFromJobName is used in the replanner to check the helix workflow to restart. So I am pretty sure I didn't break anything. Not sure if this integration test runs in the CI so I've added a screenshot of the tests passing.

GobblinHelixJobSchedulerTest

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

codecov-commenter · 2022-11-18T02:54:34Z

Codecov Report

Merging #3603 (8ce90a9) into master (cb2668e) will increase coverage by 1.61%.
The diff coverage is 56.25%.

@@             Coverage Diff              @@
##             master    #3603      +/-   ##
============================================
+ Coverage     46.55%   48.16%   +1.61%     
+ Complexity    10649     7828    -2821     
============================================
  Files          2129     1474     -655     
  Lines         83295    58094   -25201     
  Branches       9281     6684    -2597     
============================================
- Hits          38777    27982   -10795     
+ Misses        40954    27507   -13447     
+ Partials       3564     2605     -959

Impacted Files	Coverage Δ
...gobblin/cluster/HelixAssignedParticipantCheck.java	`0.00% <0.00%> (ø)`
.../cluster/GobblinHelixUnexpectedStateException.java	`100.00% <100.00%> (ø)`
...in/java/org/apache/gobblin/cluster/HelixUtils.java	`44.44% <100.00%> (+1.26%)`	⬆️
.../gobblin/cluster/GobblinHelixTaskStateTracker.java	`62.50% <0.00%> (-6.25%)`	⬇️
...a/org/apache/gobblin/cluster/GobblinHelixTask.java	`62.36% <0.00%> (-4.31%)`	⬇️
...ava/org/apache/gobblin/writer/FsWriterMetrics.java
...ice/modules/orchestration/AzkabanClientStatus.java
...n/kafka/writer/AbstractKafkaDataWriterBuilder.java
.../modules/flowgraph/FlowGraphConfigurationKeys.java
...org/apache/gobblin/kafka/tool/KafkaCheckpoint.java
... and 653 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

homatthew · 2022-11-18T06:08:53Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixAssignedParticipantCheck.java

-          }
-
-          return isAssignedParticipant;
+        if (participant == null) {


Please read the PR description to understand why these null checks are needed in the PR and how I tested these changes. These pieces of code are important and critical pieces with a fair amount of nuance related to helix.

homatthew · 2022-11-18T17:32:09Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java

    Map<String, WorkflowConfig> workflowConfigMap = taskDriver.getWorkflows();
-    for (String workflow : workflowConfigMap.keySet()) {
-      WorkflowConfig workflowConfig = taskDriver.getWorkflowConfig(workflow);


This previous version made 2 calls to to zookeeper when only 1 is necessary. Doing 2 calls makes it more likely to see inconsistent state.

ZihanLi58

Overall looks good to me. If there is anything that's not expected from helix side, we should throw it out in util method, it's the caller's responsibility to decide how to handle the exception here

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixAssignedParticipantCheck.java

ZihanLi58 · 2022-11-28T22:35:48Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/HelixUtils.java

+    for (Map.Entry<String, WorkflowConfig> entry : workflowConfigMap.entrySet()) {
+      String workflow = entry.getKey();
+      WorkflowConfig workflowConfig = entry.getValue();
+      if (workflowConfig == null) {


Same here, if this is not expected, throw exception instead of just log it out

Made change to throw exception. I made the choice to create a new custom exception because there are other places where make API calls to ZK / Helix, and I want to use this exception to explicitly say there is a Helix issue.

The caller of this API (Job scheduler) is using a retryer that will automatically retry this API call again if there is an exception.

… / zk

ZihanLi58

+1, thanks for the detail context here!

* upstream/master: upgrade hadoop version (apache#3621) [GOBBLIN-1744] Improve handling of null value edge cases when querying Helix (apache#3603) Handle d2 markup case when not only announcing leader (apache#3622) Update Gobblin OSS Slack channel link to a never-expire link (apache#3620) [GOBBLIN-1750] Add schemas for observability events in GaaS (apache#3610) [GOBBLIN-1757]Refactor manifest, add reader/writer and iterator for efficient reading (apache#3618) Disable flaky HiveMaterializerTest on CI/CD (apache#3619) [GOBBLIN-1754] Fixes for mysql store change monitors (apache#3615) [GOBBLIN-1756] Fix the issue that we skipping flows for multihop jobs (apache#3617) [GOBBLIN-1752] Fix race condition where FSTemplateCatalog would update at the same t… (apache#3612) [GOBBLIN-1753] Migrate DB connection pool from o.a.commons.dbcp/dbcp2 to HikariCP (apache#3613) [GOBBLIN-1748] Add logs to debug multi-hop flows creation, progression, and cleanup (apache#3608)

… querying Helix (apache#3603)" This reverts commit 6aaf485.

homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch 2 times, most recently from ad1f16f to 4528cb8 Compare November 18, 2022 05:47

homatthew changed the title ~~[GOBBLIN-1744] Improve logging in null cases when querying from Helix~~ [GOBBLIN-1744] Improve handling of null value edge cases when querying Helix Nov 18, 2022

homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch from 4528cb8 to de633cc Compare November 18, 2022 05:55

homatthew commented Nov 18, 2022

View reviewed changes

homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch 2 times, most recently from 369c7f9 to ea6d6e8 Compare November 18, 2022 06:54

homatthew marked this pull request as ready for review November 18, 2022 07:27

homatthew commented Nov 18, 2022

View reviewed changes

ZihanLi58 requested changes Nov 28, 2022

View reviewed changes

[GOBBLIN-1744] Improve logging in null cases when querying from Helix

e1c4196

homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch 2 times, most recently from cc5ec87 to 7ed8432 Compare January 10, 2023 23:31

* throws a descriptive exception when seeing invalid state from helix…

8ce90a9

… / zk

homatthew force-pushed the mh-nullcheck-replanner-GOBBLIN-1744 branch from 7ed8432 to 8ce90a9 Compare January 11, 2023 00:02

ZihanLi58 approved these changes Jan 11, 2023

View reviewed changes

ZihanLi58 merged commit 6aaf485 into apache:master Jan 11, 2023

Will-Lo added a commit to Will-Lo/incubator-gobblin that referenced this pull request Jan 23, 2023

Revert "[GOBBLIN-1744] Improve handling of null value edge cases when…

c519952

… querying Helix (apache#3603)" This reverts commit 6aaf485.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOBBLIN-1744] Improve handling of null value edge cases when querying Helix #3603

[GOBBLIN-1744] Improve handling of null value edge cases when querying Helix #3603

homatthew commented Nov 18, 2022 •

edited

Loading

codecov-commenter commented Nov 18, 2022 •

edited

Loading

homatthew Nov 18, 2022 •

edited

Loading

homatthew Nov 18, 2022

ZihanLi58 left a comment

ZihanLi58 Nov 28, 2022

homatthew Jan 10, 2023

ZihanLi58 left a comment

[GOBBLIN-1744] Improve handling of null value edge cases when querying Helix #3603

[GOBBLIN-1744] Improve handling of null value edge cases when querying Helix #3603

Conversation

homatthew commented Nov 18, 2022 • edited Loading

JIRA

Description

(1) HelixAssignedParticipantCheck

(2) HelixUtils#getWorkflowIdsFromJobNames(HelixManager helixManager, Collection jobNames)

Tests

Commits

codecov-commenter commented Nov 18, 2022 • edited Loading

Codecov Report

homatthew Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

homatthew Nov 18, 2022

Choose a reason for hiding this comment

ZihanLi58 left a comment

Choose a reason for hiding this comment

ZihanLi58 Nov 28, 2022

Choose a reason for hiding this comment

homatthew Jan 10, 2023

Choose a reason for hiding this comment

ZihanLi58 left a comment

Choose a reason for hiding this comment

homatthew commented Nov 18, 2022 •

edited

Loading

codecov-commenter commented Nov 18, 2022 •

edited

Loading

homatthew Nov 18, 2022 •

edited

Loading