Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1656] Return a http status 503 on GaaS when quota is exceeded for user or flowgroup #3516

Merged
merged 14 commits into from
Jul 25, 2022

Conversation

Will-Lo
Copy link
Contributor

@Will-Lo Will-Lo commented May 31, 2022

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

GaaS supports user and flowgroup quotas. When a user wants to create a flow that runs immediately (adhoc or scheduled) it should not run if the quota is exceeded (this is built in already). Additionally, the user's request should fail, and return a different http status code (in this PR it's 503) so that the clients can know when the flow will not run and handle the exceeded scenario.

When a user updates a scheduled flowConfig that runs immediately -> It will reject the flow configuration update and not run the job

When a user creates an adhoc flow -> the job will be rejected entirely

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented May 31, 2022

Codecov Report

Merging #3516 (4b80123) into master (2983607) will decrease coverage by 0.00%.
The diff coverage is 35.71%.

@@             Coverage Diff              @@
##             master    #3516      +/-   ##
============================================
- Coverage     46.69%   46.68%   -0.01%     
- Complexity    10431    10438       +7     
============================================
  Files          2082     2083       +1     
  Lines         81467    81516      +49     
  Branches       9091     9100       +9     
============================================
+ Hits          38037    38058      +21     
- Misses        39924    39949      +25     
- Partials       3506     3509       +3     
Impacted Files Coverage Δ
...obblin/service/FlowConfigResourceLocalHandler.java 16.00% <0.00%> (-2.19%) ⬇️
...blin/service/FlowConfigV2ResourceLocalHandler.java 0.00% <0.00%> (ø)
...apache/gobblin/runtime/api/MutableSpecCatalog.java 86.66% <ø> (ø)
...pache/gobblin/runtime/api/SpecCatalogListener.java 66.66% <ø> (ø)
...gobblin/service/modules/core/GitConfigMonitor.java 81.35% <0.00%> (-1.70%) ⬇️
...in/service/modules/core/GobblinServiceManager.java 50.00% <0.00%> (-0.21%) ⬇️
...ache/gobblin/exception/QuotaExceededException.java 0.00% <0.00%> (ø)
...ache/gobblin/runtime/spec_catalog/FlowCatalog.java 47.97% <33.33%> (-1.69%) ⬇️
...service/modules/orchestration/DagManagerUtils.java 89.28% <75.00%> (-1.10%) ⬇️
...ervice/modules/orchestration/UserQuotaManager.java 71.84% <75.00%> (+2.61%) ⬆️
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

@Will-Lo Will-Lo closed this Jun 2, 2022
@Will-Lo Will-Lo reopened this Jun 2, 2022

if (isCompileSuccessful(responseMap)) {
if (isCompileSuccessful(schedulerResponse.getValue())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it exceed quota, why compile is marked as succeed?It's a little confuse to read the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's a better terminology here:
For compilation how I interpret it is that the flow configuration can compile (src and destination and any required parameters exist).
But it can pass the compilation step but fail on a resource validation check, which doesn't mean that the flow was improperly compiled or that the inputs were incorrect. It's more that the users have too many flows already sent in the system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add comments to explain this? Just to make the code more easy to read

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a large part of this is the tunneling I describe above, but if you actually do need additional info to support isExplain/hasExplain, you could insert it into (or wrapping around) the QuotaExceededException

Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great usability improvement to give notice synchronously upon API submit, william! I propose a more explicit way to relay state, both for compile-time checking and simplicity in understanding (since amorphous types are not).

Comment on lines 113 to 120
uniqueRequesters = RequesterService.deserialize(serializedRequesters)
.stream()
.map(ServiceRequester::getName)
.distinct()
.collect(Collectors.toList());
} catch (IOException e) {
throw new RuntimeException("Could not process requesters due to ", e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at how this is just calling a static in RequesterService seems potentially appropriate abstraction as DagManagerUtils.getDistinctServiceRequesters (throwing the unchecked exception).

BTW, I don't immediately perceive the justification for:

   } catch (RuntimeException e) {
      throw new IOException(e);
    }

in RequesterService.deserialize

(and anyway is perhaps RuntimeException merely catching for its derivation JsonParseException... not sure what other kind would arise within GSON...)

Copy link
Contributor Author

@Will-Lo Will-Lo Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line above, return new ObjectMapper().readValue(jsonList, mapType); throws an IOException as well, so we'd wrap that into a RuntimeException? I think that's appropriate given that it's not likely handleable outside of logging.

@@ -322,6 +326,19 @@ public AddSpecResponse onAddSpec(Spec addedSpec) {
return new AddSpecResponse<>(response);
}

// Check quota limits against run immediately flows or adhoc flows before saving the schedule
if (!jobConfig.containsKey(ConfigurationKeys.JOB_SCHEDULE_KEY) || PropertiesUtils.getPropAsBoolean(jobConfig, ConfigurationKeys.FLOW_RUN_IMMEDIATELY, "false")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for scheduled AND runImmediately flows, seems we would only scheduleJob() when quota finds it permitted to run immediately (right now). are those semantics intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't apply for scheduled flows that do not have runImmediately enabled, if a scheduled flow is intended to be ran immediately then it goes through this check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to ascertain whether a scheduled+runImmediately flow that gets rejected because the quota is exceeded would still be scheduled (e.g. for the next run according to its schedule, even though running immediately is skipped). is that so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduled + runImmediately flow would be rejected completely here since the check is done right after compilation and before the job is added to the scheduled map.

If the flow that exists is updated to be runImmediately and exceeds quota, the runImmediate change would be rejected (and any other modifications done on the flow). The flow would retain its previous state and then ran on its old schedule.

If a flow that doesn't exist has runImmediately and exceeds quota, it is rejected from the system entirely.

So the current model in the PR is that any changes that involve running a flow immediately when the quota is exceeded is rejected. I think it's possible to move to a future state where changes to a flow is not dependent on the flow's execution schedule, but it would require a lot of refactoring. I think it can be grouped with other changes we talked about, such as the faster feedback when running adhoc flows. I think a possible future state would be where users can save flows independently of whether or not the flow is scheduled, and the execution schedule would be treated entirely separate (similar to ADF and some other systems).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels potentially confusing that the flow would not be scheduled at all (for the future) just because the quota is exceeded now. wouldn't it be safe to scheduleJob anyway (even if we don't believe the first, runImmediately execution would succeed)?


if (isCompileSuccessful(responseMap)) {
if (isCompileSuccessful(schedulerResponse.getValue())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a large part of this is the tunneling I describe above, but if you actually do need additional info to support isExplain/hasExplain, you could insert it into (or wrapping around) the QuotaExceededException

@Will-Lo Will-Lo force-pushed the quota_fail_on_create branch 2 times, most recently from b82212f to 726e13a Compare June 27, 2022 19:39
@@ -88,7 +101,7 @@ public CreateKVResponse createFlowConfig(FlowConfig flowConfig, boolean triggerL
addSpecResponse != null && addSpecResponse.getValue() != null ? StringEscapeUtils.escapeJson(addSpecResponse.getValue()) : "");
flowConfig.setProperties(props);
httpStatus = HttpStatus.S_200_OK;
} else if (Boolean.parseBoolean(responseMap.getOrDefault(ServiceConfigKeys.COMPILATION_SUCCESSFUL, new AddSpecResponse<>("false")).getValue().toString())) {
} else if (Boolean.parseBoolean(response.getValue())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of the raw type, it may be slightly safer to keep the (seemingly superfluous) .toString() from the original. in fact, you could possibly not add line 92 and preserve this one as in the orig.

static List<String> getDistinctUniqueRequesters(String serializedRequesters) {
List<String> uniqueRequesters;
try {
uniqueRequesters = RequesterService.deserialize(serializedRequesters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no biggie, but you could place the return statement within the try

@@ -322,6 +326,19 @@ public AddSpecResponse onAddSpec(Spec addedSpec) {
return new AddSpecResponse<>(response);
}

// Check quota limits against run immediately flows or adhoc flows before saving the schedule
if (!jobConfig.containsKey(ConfigurationKeys.JOB_SCHEDULE_KEY) || PropertiesUtils.getPropAsBoolean(jobConfig, ConfigurationKeys.FLOW_RUN_IMMEDIATELY, "false")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to ascertain whether a scheduled+runImmediately flow that gets rejected because the quota is exceeded would still be scheduled (e.g. for the next run according to its schedule, even though running immediately is skipped). is that so?

@@ -48,7 +48,7 @@ public interface MutableSpecCatalog extends SpecCatalog {
* on adding a {@link Spec} to the {@link SpecCatalog}. The key for each entry is the name of the {@link SpecCatalogListener}
* and the value is the result of the the action taken by the listener returned as an instance of {@link AddSpecResponse}.
* */
Map<String, AddSpecResponse> put(Spec spec);
Map<String, AddSpecResponse> put(Spec spec) throws Throwable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only newly added exception it seems could be thrown here is QuotaExceededException... if correct, why not make that the type mentioned in the throws clause? why go all the way down to Throwable?

Copy link
Contributor Author

@Will-Lo Will-Lo Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit convoluted here the QuotaExceededException gets thrown in a callback and is caught by CallbackResult, which stores any exception thrown as a Throwable. I wanted to leverage that by accessing it through getCause(). Classes that rely on the CallBackResult responses would have to throw the generic Throwable class unless I casted the exception back into the QuotaExceededException, but I want to avoid doing that for future support of throwing compilation errors in this fashion as well.

In particular in this segment:

      if (response.getValue().getFailures().size() > 0) {
        for (Map.Entry<SpecCatalogListener, CallbackResult<AddSpecResponse>> entry : response.getValue().getFailures().entrySet()) {
          throw entry.getValue().getError().getCause();
        }
        return responseMap;
      }

@@ -48,7 +48,7 @@ public AddSpecCallback(Spec addedSpec) {
_addedSpec = addedSpec;
}

@Override public AddSpecResponse apply(SpecCatalogListener listener) {
public AddSpecResponse apply(SpecCatalogListener listener) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what changed to lead you to remove @Override... are you just omitting what is optional or are you somehow actually no longer making an override? (I recommend @Override to catch errors.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry that was a mistake, I was changing a lot of function definitions trying to figure out the errors and this slipped through

Copy link
Contributor

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great to see the cleaner approach of letting java's built-in catch RTTI act for you... now just one improvement called for to avoid hiding up-to-and-including ultra-severe and even irrecoverable system events as OOM, etc.

if (e instanceof QuotaExceededException) {
throw new RestLiServiceException(HttpStatus.S_503_SERVICE_UNAVAILABLE, e.getMessage());
}
// TODO: Compilation errors should fall under throwable exceptions as well instead of checking for strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is just a request handler, so not hugely consequential to catch all exceptions, but please at least just log whatever you find here, so it's not a silent/squelched failure we have a hard time determining to be happening one future day.

if (e instanceof QuotaExceededException) {
throw new RestLiServiceException(HttpStatus.S_503_SERVICE_UNAVAILABLE, e.getMessage());
}
// TODO: Compilation errors should fall under throwable exceptions as well instead of checking for strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

throw new RestLiServiceException(HttpStatus.S_503_SERVICE_UNAVAILABLE, e.getMessage());
}
} catch (Throwable e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again

Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

@@ -181,6 +183,8 @@ public UpdateResponse updateFlowConfig(FlowId flowId, FlowConfig flowConfig, boo
throw new RestLiServiceException(HttpStatus.S_503_SERVICE_UNAVAILABLE, e.getMessage());
} catch (Throwable e) {
// TODO: Compilation errors should fall under throwable exceptions as well instead of checking for strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely not major, but I'm unclear specifically what this TODO is suggesting ought to change in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a ticket for it, but right now the flow compilation process is really descriptive what is missing, and users would see a generic HTTP status 400 error with a "Path does not exist" due to failing compilation. I think if we utilized exceptions rather than that compilation check it would be easier to differentiate and reason about.

@Will-Lo Will-Lo force-pushed the quota_fail_on_create branch from 9513a3a to d7e92ea Compare July 21, 2022 20:17
Copy link
Contributor

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ZihanLi58 ZihanLi58 merged commit 9ce1e65 into apache:master Jul 25, 2022
jack-moseley pushed a commit to jack-moseley/gobblin that referenced this pull request Aug 24, 2022
…d for user or flowgroup (apache#3516)

* Add e2e tests and set http response code for quota exceeded

* cleanup

* Fix checkstyle test

* Improve guard against schedule change if quota is exceeded

* Fix bug relating to exception propagation and scheduler not checking quota due to current attempt number

* Address review comments

* Refactor based on review feedback

* Fix test

* Cleanup around handling responses from callbacks in GaaS API

* Fix checkstyle

* catch quotaexceededexception instead of checking type explicitly

* Log other errors and throw 500

* Fix checkstyle dead store

* Fix checkstyle again
phet added a commit to phet/gobblin that referenced this pull request Sep 12, 2022
* upstream/master: (124 commits)
  [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs apache#3552
  fix helix job wait completion bug when job goes to STOPPING state (apache#3556)
  [GOBBLIN-1695] Fix: Failure to add spec executors doesn't block deployment (apache#3551)
  [GOBBLIN-1701] Replace jcenter with either maven central or gradle plugin portal (apache#3554)
  [GOBBLIN-1700] Remove unused coveralls-gradle-plugin dependency
  add MysqlUserQuotaManager (apache#3545)
  [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode (apache#3544)
  Add GMCE topic explicitly to hive commit event (apache#3547)
  [GOBBLIN-1678] Refactor git flowgraph component to be extensible (apache#3536)
  [GOBBLIN-1690] Added logging to ORC writer
  Allow all iceberg exceptions to be fault tolerant (apache#3541)
  Guard against exists fs call as well (apache#3538)
  Add error handling for timeaware finder to handle scenarios where fil… (apache#3537)
  [GOBBLIN-1675] Add pagination for GaaS on server side (apache#3533)
  [GOBBLIN-1672] Refactor metrics from DagManager into its own class, add metrics per … (apache#3532)
  [GOBBLIN-1677] Fix timezone property to read from key correctly (apache#3535)
  [Gobblin-931] Fix typo in gobblin CLI usage (apache#3530)
  [GOBBLIN-1671] : Fix gobblin.sh script to add external jars as colon separated to HADOOP_CLASSPATH (apache#3531)
  [GOBBLIN-1656] Return a http status 503 on GaaS when quota is exceeded for user or flowgroup (apache#3516)
  [GOBBLIN-1669] Clean up TimeAwareRecursiveCopyableDataset to support seconds in time… (apache#3528)
  [GOBBLIN-1670] Remove rat tasks and unneeded checkstyles blocking build pipeline (apache#3529)
  [GOBBLIN-1668] Add audit counts for iceberg registration (apache#3527)
  [GOBBLIN-1667] Create new predicate - ExistingPartitionSkipPredicate (apache#3526)
  Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn (apache#3524)
  make the requestedContainerCountMap correctly update the container count (apache#3523)
  Fix running counts for retried flows (apache#3520)
  Allow table to flush after write failure (apache#3522)
  [GOBBLIN-1652]Add more log in the KafkaJobStatusMonitor in case it fails to process one GobblinTrackingEvent (apache#3513)
  Make Yarn container and helix instance allocation group by tag (apache#3519)
  [GOBBLIN-1657] Update completion watermark on change_property in IcebergMetadataWriter (apache#3517)
  [GOBBLIN-1654] Add capacity floor to avoid aggressively requesting resource and small files. (apache#3515)
  [GOBBLIN-1653] Shorten job name length if it exceeds 255 characters (apache#3514)
  [GOBBLIN-1650] Implement flowGroup quotas for the DagManager (apache#3511)
  [GOBBLIN-1648] Complete use of JDBC `DataSource` 'read-only' validation query by incorporating where previously omitted (apache#3509)
  Add config to set close timeout in HiveRegister (apache#3512)
  add an API in AbstractBaseKafkaConsumerClient to list selected topics (apache#3501)
  [GOBBLIN-1649] Revert gobblin-1633 (apache#3510)
  [GOBBLIN-1639] Prevent metrics reporting if configured, clean up workunit count metric (apache#3500)
  [GOBBLIN-1647] Add hive commit GTE to HiveMetadataWriter (apache#3508)
  [GOBBLIN-1633] Fix compaction actions on job failure not retried if compaction succeeds (apache#3494)
  [GOBBLIN-1646] Revert yarn container / helix tag group changes (apache#3507)
  [GOBBLIN-1641] Add meter for sla exceeded flows (apache#3502)
  GOBBLIN-1644 (apache#3506)
  [GOBBLIN-1645]Change the prefix of dagManager heartbeat to make it consistent with other metrics (apache#3505)
  Fix bug when shrinking the container in Yarn service (apache#3504)
  [GOBBLIN-1637] Add writer, operation, and partition info to failed metadata writer events (apache#3498)
  [GOBBLIN-1638] Fix unbalanced running count metrics due to Azkaban failures (apache#3499)
  [GOBBLIN-1634] Add retries on flow sla kills (apache#3495)
  [GOBBLIN-1620]Make yarn container allocation group by helix tag (apache#3487)
  [GOBBLIN-1636] Close DatasetCleaner after clean task (apache#3497)
  [GOBBLIN-1635] Avoid loading env configuration when using config store to improve the performance (apache#3496)
  use user supplied props to create FileSystem in DatasetCleanerTask (apache#3483)
  [GOBBLIN-1619] WriterUtils.mkdirsWithRecursivePermission contains race condition and puts unnecessary load on filesystem (apache#3477)
  use data node aliases to figure out data node names before using DMAS (apache#3493)
  [GOBBLIN-1630] Remove flow level metrics for adhoc flows (apache#3491)
  [GOBBLIN-1631]Emit heartbeat for dagManagerThread (apache#3492)
  [GOBBLIN-1624] Refactor quota management, fix various bugs in accounting of running … (apache#3481)
  [GOBBLIN-1613] Add metadata writers field to GMCE schema (apache#3490)
  Update README.md
  [GOBBLIN-1629] Make GobblinMCEWriter be able to catch error when calculating hive specs (apache#3489)
  Add/fix some fields of MetadataWriterFailureEvent (apache#3485)
  [GOBBLIN-1627] provide option to convert datanodes names (apache#3484)
  Add coverage for edge cases when table paths do not exist, check parents (apache#3482)
  [GOBBLIN-1616] Add close connection logic in salseforceSource (apache#3486)
  [GOBBLIN-1621] Make HelixRetriggeringJobCallable emit job skip event when job is dropped due to previous job is running (apache#3478)
  [GOBBLIN-1623] Fix NPE when try to close RestApiConnector (apache#3480)
  Clear bad mysql packages from cache in CI/CD machines (apache#3479)
  [GOBBLIN-1617] pass configurations to some HadoopUtils APIs (apache#3475)
  [GOBBLIN-1616] Make RestApiConnector be able to close the connection finally (apache#3474)
  add config to set log level for any class (apache#3473)
  Fix bug where partitioned tables would always return the wrong equality in paths (apache#3472)
  [GOBBLIN-1602] Change hive table location and partition check to validate using FS r… (apache#3459)
  Don't flush on change_property operation (apache#3467)
  Fix case where error GTE is incorrectly sent from MCE writer (apache#3466)
  partial rollback of PR 3464 (apache#3465)
  [GOBBLIN-1604] Throw exception if there are no allocated requests due to lack of res… (apache#3461)
  [GOBBLIN-1603] Throws error if configured when encountering an IO exception while co… (apache#3460)
  [GOBBLIN-1606] change DEFAULT_GOBBLIN_COPY_CHECK_FILESIZE value (apache#3464)
  Upgraded dropwizard metrics library version from 3.2.3 -> 4.1.2 and added a new wrapper class on dropwizard Timer.Context class to handle the code compatibility as the newer version of this class implements AutoClosable instead of Closable. (apache#3463)
  [GOBBLIN-1605] Fix mysql ubuntu download 404 not found for Github Actions CI/CD (apache#3462)
  [GOBBLIN-1601] implement ChangePermissionCommitStep (apache#3457)
  [GOBBLIN-1598]Fix metrics already exist issue in dag manager (apache#3454)
  [GOBBLIN-1597] Add error handling in dagmanager to continue if dag fails to process,… (apache#3452)
  GOBBLIN-1579 Fail job on hive existing target table location mismatch (apache#3433)
  [GOBBLIN-1596] Ignore already exists exception if the table has already been created… (apache#3451)
  [GOBBLIn-1595]Fix the dead lock during hive registration (apache#3450)
  Add guard in DagManager for improperly formed SLA (apache#3449)
  [GOBBLIN-1588] Send failure events for write failures when watermark is advanced in MCE writer (apache#3441)
  [GOBBLIN-1593] Fix bugs in dag manager about metric reporting and job status monitor (apache#3448)
  Fix bug in `JobSpecSerializer` of inadequately preventing access errors (within `MysqlJobCatalog`) (apache#3447)
  [GOBBLIN-1583] Add System level job start SLA (apache#3437)
  [GOBBLIN-1592] Make hive copy be able to apply filter on directory (apache#3446)
  [GOBBLIN-1585]GaaS (DagManager) keep retrying a failed job beyond max attempt number (apache#3439)
  [GOBBLIN-1590] Add low/high watermark information in event emitted by Gobblin cluster (apache#3443)
  [HotFix]Try to fix the mysql dependency issue in Github action (apache#3445)
  Lazily initialize FileContext and do not store a handle of it so it can be GC'ed when required (apache#3444)
  [GOBBLIN-1584] Add replace record logic for Mysql writer (apache#3438)
  Bump up code cov version (apache#3440)
  [GOBBLIN-1581] Iterate over Sql ResultSet in Only the Forward Direction (apache#3435)
  [GOBBLIN-1575] use reference count in helix manager, so that connect/disconnect are called once and at the right time (apache#3427)
  ...
phet added a commit to phet/gobblin that referenced this pull request Sep 19, 2022
* upstream/master: (124 commits)
  [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs apache#3552
  fix helix job wait completion bug when job goes to STOPPING state (apache#3556)
  [GOBBLIN-1695] Fix: Failure to add spec executors doesn't block deployment (apache#3551)
  [GOBBLIN-1701] Replace jcenter with either maven central or gradle plugin portal (apache#3554)
  [GOBBLIN-1700] Remove unused coveralls-gradle-plugin dependency
  add MysqlUserQuotaManager (apache#3545)
  [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode (apache#3544)
  Add GMCE topic explicitly to hive commit event (apache#3547)
  [GOBBLIN-1678] Refactor git flowgraph component to be extensible (apache#3536)
  [GOBBLIN-1690] Added logging to ORC writer
  Allow all iceberg exceptions to be fault tolerant (apache#3541)
  Guard against exists fs call as well (apache#3538)
  Add error handling for timeaware finder to handle scenarios where fil… (apache#3537)
  [GOBBLIN-1675] Add pagination for GaaS on server side (apache#3533)
  [GOBBLIN-1672] Refactor metrics from DagManager into its own class, add metrics per … (apache#3532)
  [GOBBLIN-1677] Fix timezone property to read from key correctly (apache#3535)
  [Gobblin-931] Fix typo in gobblin CLI usage (apache#3530)
  [GOBBLIN-1671] : Fix gobblin.sh script to add external jars as colon separated to HADOOP_CLASSPATH (apache#3531)
  [GOBBLIN-1656] Return a http status 503 on GaaS when quota is exceeded for user or flowgroup (apache#3516)
  [GOBBLIN-1669] Clean up TimeAwareRecursiveCopyableDataset to support seconds in time… (apache#3528)
  [GOBBLIN-1670] Remove rat tasks and unneeded checkstyles blocking build pipeline (apache#3529)
  [GOBBLIN-1668] Add audit counts for iceberg registration (apache#3527)
  [GOBBLIN-1667] Create new predicate - ExistingPartitionSkipPredicate (apache#3526)
  Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn (apache#3524)
  make the requestedContainerCountMap correctly update the container count (apache#3523)
  Fix running counts for retried flows (apache#3520)
  Allow table to flush after write failure (apache#3522)
  [GOBBLIN-1652]Add more log in the KafkaJobStatusMonitor in case it fails to process one GobblinTrackingEvent (apache#3513)
  Make Yarn container and helix instance allocation group by tag (apache#3519)
  [GOBBLIN-1657] Update completion watermark on change_property in IcebergMetadataWriter (apache#3517)
  [GOBBLIN-1654] Add capacity floor to avoid aggressively requesting resource and small files. (apache#3515)
  [GOBBLIN-1653] Shorten job name length if it exceeds 255 characters (apache#3514)
  [GOBBLIN-1650] Implement flowGroup quotas for the DagManager (apache#3511)
  [GOBBLIN-1648] Complete use of JDBC `DataSource` 'read-only' validation query by incorporating where previously omitted (apache#3509)
  Add config to set close timeout in HiveRegister (apache#3512)
  add an API in AbstractBaseKafkaConsumerClient to list selected topics (apache#3501)
  [GOBBLIN-1649] Revert gobblin-1633 (apache#3510)
  [GOBBLIN-1639] Prevent metrics reporting if configured, clean up workunit count metric (apache#3500)
  [GOBBLIN-1647] Add hive commit GTE to HiveMetadataWriter (apache#3508)
  [GOBBLIN-1633] Fix compaction actions on job failure not retried if compaction succeeds (apache#3494)
  [GOBBLIN-1646] Revert yarn container / helix tag group changes (apache#3507)
  [GOBBLIN-1641] Add meter for sla exceeded flows (apache#3502)
  GOBBLIN-1644 (apache#3506)
  [GOBBLIN-1645]Change the prefix of dagManager heartbeat to make it consistent with other metrics (apache#3505)
  Fix bug when shrinking the container in Yarn service (apache#3504)
  [GOBBLIN-1637] Add writer, operation, and partition info to failed metadata writer events (apache#3498)
  [GOBBLIN-1638] Fix unbalanced running count metrics due to Azkaban failures (apache#3499)
  [GOBBLIN-1634] Add retries on flow sla kills (apache#3495)
  [GOBBLIN-1620]Make yarn container allocation group by helix tag (apache#3487)
  [GOBBLIN-1636] Close DatasetCleaner after clean task (apache#3497)
  [GOBBLIN-1635] Avoid loading env configuration when using config store to improve the performance (apache#3496)
  use user supplied props to create FileSystem in DatasetCleanerTask (apache#3483)
  [GOBBLIN-1619] WriterUtils.mkdirsWithRecursivePermission contains race condition and puts unnecessary load on filesystem (apache#3477)
  use data node aliases to figure out data node names before using DMAS (apache#3493)
  [GOBBLIN-1630] Remove flow level metrics for adhoc flows (apache#3491)
  [GOBBLIN-1631]Emit heartbeat for dagManagerThread (apache#3492)
  [GOBBLIN-1624] Refactor quota management, fix various bugs in accounting of running … (apache#3481)
  [GOBBLIN-1613] Add metadata writers field to GMCE schema (apache#3490)
  Update README.md
  [GOBBLIN-1629] Make GobblinMCEWriter be able to catch error when calculating hive specs (apache#3489)
  Add/fix some fields of MetadataWriterFailureEvent (apache#3485)
  [GOBBLIN-1627] provide option to convert datanodes names (apache#3484)
  Add coverage for edge cases when table paths do not exist, check parents (apache#3482)
  [GOBBLIN-1616] Add close connection logic in salseforceSource (apache#3486)
  [GOBBLIN-1621] Make HelixRetriggeringJobCallable emit job skip event when job is dropped due to previous job is running (apache#3478)
  [GOBBLIN-1623] Fix NPE when try to close RestApiConnector (apache#3480)
  Clear bad mysql packages from cache in CI/CD machines (apache#3479)
  [GOBBLIN-1617] pass configurations to some HadoopUtils APIs (apache#3475)
  [GOBBLIN-1616] Make RestApiConnector be able to close the connection finally (apache#3474)
  add config to set log level for any class (apache#3473)
  Fix bug where partitioned tables would always return the wrong equality in paths (apache#3472)
  [GOBBLIN-1602] Change hive table location and partition check to validate using FS r… (apache#3459)
  Don't flush on change_property operation (apache#3467)
  Fix case where error GTE is incorrectly sent from MCE writer (apache#3466)
  partial rollback of PR 3464 (apache#3465)
  [GOBBLIN-1604] Throw exception if there are no allocated requests due to lack of res… (apache#3461)
  [GOBBLIN-1603] Throws error if configured when encountering an IO exception while co… (apache#3460)
  [GOBBLIN-1606] change DEFAULT_GOBBLIN_COPY_CHECK_FILESIZE value (apache#3464)
  Upgraded dropwizard metrics library version from 3.2.3 -> 4.1.2 and added a new wrapper class on dropwizard Timer.Context class to handle the code compatibility as the newer version of this class implements AutoClosable instead of Closable. (apache#3463)
  [GOBBLIN-1605] Fix mysql ubuntu download 404 not found for Github Actions CI/CD (apache#3462)
  [GOBBLIN-1601] implement ChangePermissionCommitStep (apache#3457)
  [GOBBLIN-1598]Fix metrics already exist issue in dag manager (apache#3454)
  [GOBBLIN-1597] Add error handling in dagmanager to continue if dag fails to process,… (apache#3452)
  GOBBLIN-1579 Fail job on hive existing target table location mismatch (apache#3433)
  [GOBBLIN-1596] Ignore already exists exception if the table has already been created… (apache#3451)
  [GOBBLIn-1595]Fix the dead lock during hive registration (apache#3450)
  Add guard in DagManager for improperly formed SLA (apache#3449)
  [GOBBLIN-1588] Send failure events for write failures when watermark is advanced in MCE writer (apache#3441)
  [GOBBLIN-1593] Fix bugs in dag manager about metric reporting and job status monitor (apache#3448)
  Fix bug in `JobSpecSerializer` of inadequately preventing access errors (within `MysqlJobCatalog`) (apache#3447)
  [GOBBLIN-1583] Add System level job start SLA (apache#3437)
  [GOBBLIN-1592] Make hive copy be able to apply filter on directory (apache#3446)
  [GOBBLIN-1585]GaaS (DagManager) keep retrying a failed job beyond max attempt number (apache#3439)
  [GOBBLIN-1590] Add low/high watermark information in event emitted by Gobblin cluster (apache#3443)
  [HotFix]Try to fix the mysql dependency issue in Github action (apache#3445)
  Lazily initialize FileContext and do not store a handle of it so it can be GC'ed when required (apache#3444)
  [GOBBLIN-1584] Add replace record logic for Mysql writer (apache#3438)
  Bump up code cov version (apache#3440)
  [GOBBLIN-1581] Iterate over Sql ResultSet in Only the Forward Direction (apache#3435)
  [GOBBLIN-1575] use reference count in helix manager, so that connect/disconnect are called once and at the right time (apache#3427)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants