Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1773] Fix bugs in quota manager #3636

Merged
merged 7 commits into from
Feb 9, 2023

Conversation

ZihanLi58
Copy link
Contributor

@ZihanLi58 ZihanLi58 commented Feb 7, 2023

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    We used to check the quota when we accept flow for ad-hoc flow and run immediately flow. But we only use the same execution id for ad-hoc flow. So for a run immediately flow, we will check quota once with a random execution id, and then really run the job with a different execution id and forget the first one. This will cause us to double-check the quota for a run immediately flow and never release one of them.
    In this PR:
  1. We change the behavior to only check the quota upon accepting the request when it's an adhoc job, and for run-immediate flow, we won't check the quota. If quota exceeds, we will fail the first job but still keep the config and schedule the job in the future
  2. When we see there is duplicate flow running and skip the exception, if it is an adhoc flow, we will release the quota for corresponding execution.
  3. We change the order where to make sure when increase/decrease quota, we use the same order to update the key value to avoid transaction deadlock
  4. We use shared data source to avoid too many thread issue

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Unit tests

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Feb 7, 2023

Codecov Report

Merging #3636 (b9e6fab) into master (9f8ab24) will increase coverage by 6.43%.
The diff coverage is 55.10%.

@@             Coverage Diff              @@
##             master    #3636      +/-   ##
============================================
+ Coverage     40.16%   46.59%   +6.43%     
- Complexity     3544    10673    +7129     
============================================
  Files           791     2133    +1342     
  Lines         33285    83568   +50283     
  Branches       3699     9292    +5593     
============================================
+ Hits          13368    38938   +25570     
- Misses        18601    41067   +22466     
- Partials       1316     3563    +2247     
Impacted Files Coverage Δ
...org/apache/gobblin/metrics/InnerMetricContext.java 55.67% <0.00%> (ø)
...ervice/modules/flow/BaseFlowToJobSpecCompiler.java 55.00% <0.00%> (ø)
...rvice/modules/orchestration/DagManagerMetrics.java 87.25% <0.00%> (ø)
...in/service/modules/orchestration/Orchestrator.java 52.73% <28.57%> (ø)
...e/modules/orchestration/MysqlUserQuotaManager.java 47.67% <62.50%> (ø)
...blin/service/modules/orchestration/DagManager.java 83.19% <62.96%> (ø)
.../runtime/dag_action_store/MysqlDagActionStore.java 80.00% <100.00%> (ø)
.../modules/scheduler/GobblinServiceJobScheduler.java 63.03% <100.00%> (ø)
.../gobblin/cluster/GobblinHelixTaskStateTracker.java 62.50% <0.00%> (-6.25%) ⬇️
...in/java/org/apache/gobblin/cluster/HelixUtils.java 44.44% <0.00%> (-1.67%) ⬇️
... and 1343 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ZihanLi58 ZihanLi58 changed the title Gobblin 1773 [GOBBLIN-1773] Fix bugs in quota manager Feb 7, 2023
// This block should be reachable only for the first execution for the adhoc flows (flows that either do not have a schedule or have runImmediately=true.
if (!this.warmStandbyEnabled && !jobConfig.containsKey(ConfigurationKeys.JOB_SCHEDULE_KEY)) {
// This block should be reachable only for the execution for the adhoc flows
// For flow that has scheduler but run-immediately set to be true, we won't check teh quota as we will use a different execution id later
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo the

@@ -330,8 +330,9 @@ public AddSpecResponse onAddSpec(Spec addedSpec) {

// Check quota limits against run immediately flows or adhoc flows before saving the schedule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets update this to be adhoc flows only

@@ -330,8 +330,9 @@ public AddSpecResponse onAddSpec(Spec addedSpec) {

// Check quota limits against run immediately flows or adhoc flows before saving the schedule
// In warm standby mode, this quota check will happen on restli API layer when we accept the flow
Copy link
Contributor

@umustafi umustafi Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik FlowConfigV2ResourceLocalHandler and FlowCatalog are not doing quota check on adding the spec. Where do we do the check in API layer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In warmStandyby mode, scheduler is not the listener when we add flow spec to flowCatalog. But compiler is the listener as defined here. And we check quota here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see did not notice the difference in warm standby mode.

@@ -67,7 +68,8 @@ public MysqlDagActionStore(Config config) throws IOException {
this.tableName = ConfigUtils.getString(config, ConfigurationKeys.STATE_STORE_DB_TABLE_KEY,
ConfigurationKeys.DEFAULT_STATE_STORE_DB_TABLE);

this.dataSource = MysqlStateStore.newDataSource(config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we deem this function not safe to use as it does not read from the shared threadpool, would it be appropriate to make MySqlStateStore.newDataSource package private?

@@ -247,6 +253,13 @@ public void orchestrate(Spec spec) throws Exception {
+ "concurrent executions are disabled for this flow.", flowGroup, flowName);
conditionallyUpdateFlowGaugeSpecState(spec, CompiledState.SKIPPED);
Instrumented.markMeter(this.skippedFlowsMeter);
if (!((FlowSpec)spec).getConfigAsProperties().containsKey(ConfigurationKeys.JOB_SCHEDULE_KEY)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can come in a later PR but we should have a static function to check if a FlowSpec is adhoc or not in the FlowSpec class

@ZihanLi58 ZihanLi58 merged commit 8cb523d into apache:master Feb 9, 2023
phet added a commit to phet/gobblin that referenced this pull request Feb 13, 2023
* upstream/master:
  [GOBBLIN-1774] Util for detecting non optional uniontypes Hive tables (apache#3632)
  [GOBBLIN-1773] Fix bugs in quota manager (apache#3636)
  [GOBBLIN-1782] Fix Merge State for Flow Pending Resume statuses (apache#3639)
  [GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp (apache#3616)
  [GOBBLIN-1780] Refactor/rename YarnServiceIT to YarnServiceTest (apache#3637)
  [GOBBLIN-1778] Add house keeping thread in DagManager to periodically sync in memory state with mysql table (apache#3635)
  Register gauge metrics for change monitors (apache#3634)
phet added a commit to phet/gobblin that referenced this pull request Mar 24, 2023
* upstream/master:
  [GOBBLIN-1774] Util for detecting non optional uniontypes Hive tables (apache#3632)
  [GOBBLIN-1773] Fix bugs in quota manager (apache#3636)
  [GOBBLIN-1782] Fix Merge State for Flow Pending Resume statuses (apache#3639)
  [GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp (apache#3616)
  [GOBBLIN-1780] Refactor/rename YarnServiceIT to YarnServiceTest (apache#3637)
  [GOBBLIN-1778] Add house keeping thread in DagManager to periodically sync in memory state with mysql table (apache#3635)
  Register gauge metrics for change monitors (apache#3634)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants