-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define basics for collecting Iceberg metadata for the current snapshot #3559
Conversation
…ugin portal (apache#3554) * remove jcentral * Use gradle plugin portal for shadow * Use maven central in all other cases
…yment (apache#3551) * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Add handling for fetchSession throwing an exception * Add logging when fails on constructor and initialization, but continue to local deploy * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Fixed vars * Revert changes on azkabanSpecProducer * clean up error throwing * revert function checking changes * Reformat file * Clean up function * Format file for try/catch * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Fixed rebase * Fixed rebase * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Add whitespace back
…ache#3556) address comments update stoppingStateEndTime with currentTime update test cases
…compaction jobs apache#3552 * before starting reduce * after first record is reduced * after reducing every 1000 records Co-authored-by: Urmi Mustafi <[email protected]>
[GOBBLIN-1700] Remove unused coveralls-gradle-plugin dependency
* upstream/master: (124 commits) [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs apache#3552 fix helix job wait completion bug when job goes to STOPPING state (apache#3556) [GOBBLIN-1695] Fix: Failure to add spec executors doesn't block deployment (apache#3551) [GOBBLIN-1701] Replace jcenter with either maven central or gradle plugin portal (apache#3554) [GOBBLIN-1700] Remove unused coveralls-gradle-plugin dependency add MysqlUserQuotaManager (apache#3545) [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode (apache#3544) Add GMCE topic explicitly to hive commit event (apache#3547) [GOBBLIN-1678] Refactor git flowgraph component to be extensible (apache#3536) [GOBBLIN-1690] Added logging to ORC writer Allow all iceberg exceptions to be fault tolerant (apache#3541) Guard against exists fs call as well (apache#3538) Add error handling for timeaware finder to handle scenarios where fil… (apache#3537) [GOBBLIN-1675] Add pagination for GaaS on server side (apache#3533) [GOBBLIN-1672] Refactor metrics from DagManager into its own class, add metrics per … (apache#3532) [GOBBLIN-1677] Fix timezone property to read from key correctly (apache#3535) [Gobblin-931] Fix typo in gobblin CLI usage (apache#3530) [GOBBLIN-1671] : Fix gobblin.sh script to add external jars as colon separated to HADOOP_CLASSPATH (apache#3531) [GOBBLIN-1656] Return a http status 503 on GaaS when quota is exceeded for user or flowgroup (apache#3516) [GOBBLIN-1669] Clean up TimeAwareRecursiveCopyableDataset to support seconds in time… (apache#3528) [GOBBLIN-1670] Remove rat tasks and unneeded checkstyles blocking build pipeline (apache#3529) [GOBBLIN-1668] Add audit counts for iceberg registration (apache#3527) [GOBBLIN-1667] Create new predicate - ExistingPartitionSkipPredicate (apache#3526) Calculate requested container count based on adding allocated count and outstanding ContainerRequests in Yarn (apache#3524) make the requestedContainerCountMap correctly update the container count (apache#3523) Fix running counts for retried flows (apache#3520) Allow table to flush after write failure (apache#3522) [GOBBLIN-1652]Add more log in the KafkaJobStatusMonitor in case it fails to process one GobblinTrackingEvent (apache#3513) Make Yarn container and helix instance allocation group by tag (apache#3519) [GOBBLIN-1657] Update completion watermark on change_property in IcebergMetadataWriter (apache#3517) [GOBBLIN-1654] Add capacity floor to avoid aggressively requesting resource and small files. (apache#3515) [GOBBLIN-1653] Shorten job name length if it exceeds 255 characters (apache#3514) [GOBBLIN-1650] Implement flowGroup quotas for the DagManager (apache#3511) [GOBBLIN-1648] Complete use of JDBC `DataSource` 'read-only' validation query by incorporating where previously omitted (apache#3509) Add config to set close timeout in HiveRegister (apache#3512) add an API in AbstractBaseKafkaConsumerClient to list selected topics (apache#3501) [GOBBLIN-1649] Revert gobblin-1633 (apache#3510) [GOBBLIN-1639] Prevent metrics reporting if configured, clean up workunit count metric (apache#3500) [GOBBLIN-1647] Add hive commit GTE to HiveMetadataWriter (apache#3508) [GOBBLIN-1633] Fix compaction actions on job failure not retried if compaction succeeds (apache#3494) [GOBBLIN-1646] Revert yarn container / helix tag group changes (apache#3507) [GOBBLIN-1641] Add meter for sla exceeded flows (apache#3502) GOBBLIN-1644 (apache#3506) [GOBBLIN-1645]Change the prefix of dagManager heartbeat to make it consistent with other metrics (apache#3505) Fix bug when shrinking the container in Yarn service (apache#3504) [GOBBLIN-1637] Add writer, operation, and partition info to failed metadata writer events (apache#3498) [GOBBLIN-1638] Fix unbalanced running count metrics due to Azkaban failures (apache#3499) [GOBBLIN-1634] Add retries on flow sla kills (apache#3495) [GOBBLIN-1620]Make yarn container allocation group by helix tag (apache#3487) [GOBBLIN-1636] Close DatasetCleaner after clean task (apache#3497) [GOBBLIN-1635] Avoid loading env configuration when using config store to improve the performance (apache#3496) use user supplied props to create FileSystem in DatasetCleanerTask (apache#3483) [GOBBLIN-1619] WriterUtils.mkdirsWithRecursivePermission contains race condition and puts unnecessary load on filesystem (apache#3477) use data node aliases to figure out data node names before using DMAS (apache#3493) [GOBBLIN-1630] Remove flow level metrics for adhoc flows (apache#3491) [GOBBLIN-1631]Emit heartbeat for dagManagerThread (apache#3492) [GOBBLIN-1624] Refactor quota management, fix various bugs in accounting of running … (apache#3481) [GOBBLIN-1613] Add metadata writers field to GMCE schema (apache#3490) Update README.md [GOBBLIN-1629] Make GobblinMCEWriter be able to catch error when calculating hive specs (apache#3489) Add/fix some fields of MetadataWriterFailureEvent (apache#3485) [GOBBLIN-1627] provide option to convert datanodes names (apache#3484) Add coverage for edge cases when table paths do not exist, check parents (apache#3482) [GOBBLIN-1616] Add close connection logic in salseforceSource (apache#3486) [GOBBLIN-1621] Make HelixRetriggeringJobCallable emit job skip event when job is dropped due to previous job is running (apache#3478) [GOBBLIN-1623] Fix NPE when try to close RestApiConnector (apache#3480) Clear bad mysql packages from cache in CI/CD machines (apache#3479) [GOBBLIN-1617] pass configurations to some HadoopUtils APIs (apache#3475) [GOBBLIN-1616] Make RestApiConnector be able to close the connection finally (apache#3474) add config to set log level for any class (apache#3473) Fix bug where partitioned tables would always return the wrong equality in paths (apache#3472) [GOBBLIN-1602] Change hive table location and partition check to validate using FS r… (apache#3459) Don't flush on change_property operation (apache#3467) Fix case where error GTE is incorrectly sent from MCE writer (apache#3466) partial rollback of PR 3464 (apache#3465) [GOBBLIN-1604] Throw exception if there are no allocated requests due to lack of res… (apache#3461) [GOBBLIN-1603] Throws error if configured when encountering an IO exception while co… (apache#3460) [GOBBLIN-1606] change DEFAULT_GOBBLIN_COPY_CHECK_FILESIZE value (apache#3464) Upgraded dropwizard metrics library version from 3.2.3 -> 4.1.2 and added a new wrapper class on dropwizard Timer.Context class to handle the code compatibility as the newer version of this class implements AutoClosable instead of Closable. (apache#3463) [GOBBLIN-1605] Fix mysql ubuntu download 404 not found for Github Actions CI/CD (apache#3462) [GOBBLIN-1601] implement ChangePermissionCommitStep (apache#3457) [GOBBLIN-1598]Fix metrics already exist issue in dag manager (apache#3454) [GOBBLIN-1597] Add error handling in dagmanager to continue if dag fails to process,… (apache#3452) GOBBLIN-1579 Fail job on hive existing target table location mismatch (apache#3433) [GOBBLIN-1596] Ignore already exists exception if the table has already been created… (apache#3451) [GOBBLIn-1595]Fix the dead lock during hive registration (apache#3450) Add guard in DagManager for improperly formed SLA (apache#3449) [GOBBLIN-1588] Send failure events for write failures when watermark is advanced in MCE writer (apache#3441) [GOBBLIN-1593] Fix bugs in dag manager about metric reporting and job status monitor (apache#3448) Fix bug in `JobSpecSerializer` of inadequately preventing access errors (within `MysqlJobCatalog`) (apache#3447) [GOBBLIN-1583] Add System level job start SLA (apache#3437) [GOBBLIN-1592] Make hive copy be able to apply filter on directory (apache#3446) [GOBBLIN-1585]GaaS (DagManager) keep retrying a failed job beyond max attempt number (apache#3439) [GOBBLIN-1590] Add low/high watermark information in event emitted by Gobblin cluster (apache#3443) [HotFix]Try to fix the mysql dependency issue in Github action (apache#3445) Lazily initialize FileContext and do not store a handle of it so it can be GC'ed when required (apache#3444) [GOBBLIN-1584] Add replace record logic for Mysql writer (apache#3438) Bump up code cov version (apache#3440) [GOBBLIN-1581] Iterate over Sql ResultSet in Only the Forward Direction (apache#3435) [GOBBLIN-1575] use reference count in helix manager, so that connect/disconnect are called once and at the right time (apache#3427) ...
…een task runner / application master for Dynamic work unit allocation (apache#3539) * [GOBBLIN-1673] Schema for dynamic work unit message * [GOBBLIN-1683] Dynamic Work Unit messaging abstractions
* upstream/master: [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (apache#3539)
...ement/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergSnapshotInfo.java
Outdated
Show resolved
Hide resolved
private final List<String> manifestFilePaths; | ||
private final List<List<String>> manifestListedFilePaths; // NOTE: order parallels that of `manifestFilePaths` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these objects are logically linked and order enforced, would it make sense to make it a list of a new class, say IcebergManifestFiles objects? And each object would contain a manifest file path and a manifest list. and this object can have an interface to fetch all the files for getAllPaths()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. I added a helper POJO class
Codecov Report
@@ Coverage Diff @@
## master #3559 +/- ##
============================================
- Coverage 46.81% 46.78% -0.03%
Complexity 10514 10514
============================================
Files 2095 2099 +4
Lines 81945 81985 +40
Branches 9129 9131 +2
============================================
- Hits 38361 38357 -4
- Misses 40040 40086 +46
+ Partials 3544 3542 -2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
* upstream/master: [GOBBLIN-1710] Codecov should be optional in CI and not fail Github Actions (apache#3562) Define basics for collecting Iceberg metadata for the current snapshot (apache#3559) [GOBBLIN-1698] Fast fail during work unit generation based on config. (apache#3542)
apache#3559) * [GOBBLIN-1701] Replace jcenter with either maven central or gradle plugin portal (apache#3554) * remove jcentral * Use gradle plugin portal for shadow * Use maven central in all other cases * [GOBBLIN-1695] Fix: Failure to add spec executors doesn't block deployment (apache#3551) * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Add handling for fetchSession throwing an exception * Add logging when fails on constructor and initialization, but continue to local deploy * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Fixed vars * Revert changes on azkabanSpecProducer * clean up error throwing * revert function checking changes * Reformat file * Clean up function * Format file for try/catch * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Fixed rebase * Fixed rebase * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Add whitespace back * fix helix job wait completion bug when job goes to STOPPING state (apache#3556) address comments update stoppingStateEndTime with currentTime update test cases * [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs apache#3552 * before starting reduce * after first record is reduced * after reducing every 1000 records Co-authored-by: Urmi Mustafi <[email protected]> * Define basics for collecting Iceberg metadata for the current snapshot * [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (apache#3539) * [GOBBLIN-1673] Schema for dynamic work unit message * [GOBBLIN-1683] Dynamic Work Unit messaging abstractions * Address review comments * Correct import order Co-authored-by: Matthew Ho <[email protected]> Co-authored-by: Andy Jiang <[email protected]> Co-authored-by: Hanghang Nate Liu <[email protected]> Co-authored-by: umustafi <[email protected]> Co-authored-by: Urmi Mustafi <[email protected]> Co-authored-by: William Lo <[email protected]>
…one flow execution (#3558) * address comments * use connectionmanager when httpclient is not cloesable * [GOBBLIN-1706]Add DagActionStore to store the action to kill/resume one flow execution * add new flow execution handler which use DagactionStore to persist dag actions and let other host get the info * Make dag manager integrate with the dag action store * address comments * address comments * fix typo and add comments * [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs #3552 * before starting reduce * after first record is reduced * after reducing every 1000 records Co-authored-by: Urmi Mustafi <[email protected]> * [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (#3539) * [GOBBLIN-1673] Schema for dynamic work unit message * [GOBBLIN-1683] Dynamic Work Unit messaging abstractions * [GOBBLIN-1698] Fast fail during work unit generation based on config. (#3542) * fast fail during work unit generation based on config. * [GOBBLIN-1690] Added logging to ORC writer Closes #3543 from rdsr/master * [GOBBLIN-1678] Refactor git flowgraph component to be extensible (#3536) * Refactor git flowgraph component to be extensible * Move files to appropriate modules * Cleanup and add javadocs * Cleanup, add missing javadocs * Address review and import order * Fix findbugs * Use java sort instead of collections * Add GMCE topic explicitly to hive commit event (#3547) * [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode (#3544) * address comments * use connectionmanager when httpclient is not cloesable * [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode * add orchestor as listener before service start * fix code style * address comments * fix test case to test orchestor as one listener of flow spec * remove unintentional change * remove unused import * address comments * fix typo Co-authored-by: Zihan Li <[email protected]> * fast fail during work unit generation based on config. Co-authored-by: Meeth Gala <[email protected]> Co-authored-by: Ratandeep <[email protected]> Co-authored-by: William Lo <[email protected]> Co-authored-by: Jack Moseley <[email protected]> Co-authored-by: Zihan Li <[email protected]> Co-authored-by: Zihan Li <[email protected]> * Define basics for collecting Iceberg metadata for the current snapshot (#3559) * [GOBBLIN-1701] Replace jcenter with either maven central or gradle plugin portal (#3554) * remove jcentral * Use gradle plugin portal for shadow * Use maven central in all other cases * [GOBBLIN-1695] Fix: Failure to add spec executors doesn't block deployment (#3551) * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Add handling for fetchSession throwing an exception * Add logging when fails on constructor and initialization, but continue to local deploy * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Fixed vars * Revert changes on azkabanSpecProducer * clean up error throwing * revert function checking changes * Reformat file * Clean up function * Format file for try/catch * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Fixed rebase * Fixed rebase * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Add whitespace back * fix helix job wait completion bug when job goes to STOPPING state (#3556) address comments update stoppingStateEndTime with currentTime update test cases * [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs #3552 * before starting reduce * after first record is reduced * after reducing every 1000 records Co-authored-by: Urmi Mustafi <[email protected]> * Define basics for collecting Iceberg metadata for the current snapshot * [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (#3539) * [GOBBLIN-1673] Schema for dynamic work unit message * [GOBBLIN-1683] Dynamic Work Unit messaging abstractions * Address review comments * Correct import order Co-authored-by: Matthew Ho <[email protected]> Co-authored-by: Andy Jiang <[email protected]> Co-authored-by: Hanghang Nate Liu <[email protected]> Co-authored-by: umustafi <[email protected]> Co-authored-by: Urmi Mustafi <[email protected]> Co-authored-by: William Lo <[email protected]> * [GOBBLIN-1710] Codecov should be optional in CI and not fail Github Actions (#3562) * [GOBBLIN-1711] Replace Jcenter with maven central (#3566) * [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding (#3549) * address comments * use connectionmanager when httpclient is not cloesable * fix test case to test orchestor as one listener of flow spec * remove unintentional change * [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding * fix compilation error * address comments * address comments * address comments * update outdated javadoc Co-authored-by: Zihan Li <[email protected]> * [GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg (#3560) * initial commit for iceberg distcp. * adding copy entity helper and icerbeg distcp template and test case. * Adding unit tests and refactoring method definitions for an Iceberg dataset. * resolve conflicts after cleaning history * update iceberg dataset and finder to include javadoc * addressed comments on PR and aligned code check style * renamed vars, added logging and updated javadoc * update dataset descriptor with ternary operation and rename fs to sourceFs * added source and target fs and update iceberg dataset finder constructor * Update source and dest dataset methods as protected and add req args constructor * change the order of attributes for iceberg dataset finder ctor * update iceberg dataset methods with correct source and target fs Co-authored-by: Meeth Gala <[email protected]> * [GOBBLIN-1707] Add `IcebergTableTest` unit test (#3564) * Add `IcebergTableTest` unit test * Fixup comment and indentation * Minor correction of `Long` => `Integer` * Correct comment * [GOBBLIN-1711] Replace Jcenter with maven central (#3566) * Minor rename of local var Co-authored-by: Matthew Ho <[email protected]> * [GOBBLIN-1708] Improve TimeAwareRecursiveCopyableDataset to lookback only into datefolders that match range (#3563) * Check datetime range validity prior to recursing * Remove unused packages * Remove extra line * Reformat function * Check string prior to parsing * removed unused import * Change checkpathdatetimevalidity to use available localdatetime library parsing functions * Change to isempty * Modify check path to be flexible * Update javadoc * Add unit tests and refactor * change bind class as GOBBLIN-1697 get merged Co-authored-by: Zihan Li <[email protected]> Co-authored-by: umustafi <[email protected]> Co-authored-by: Urmi Mustafi <[email protected]> Co-authored-by: Matthew Ho <[email protected]> Co-authored-by: meethngala <[email protected]> Co-authored-by: Meeth Gala <[email protected]> Co-authored-by: Ratandeep <[email protected]> Co-authored-by: William Lo <[email protected]> Co-authored-by: Jack Moseley <[email protected]> Co-authored-by: Kip Kohn <[email protected]> Co-authored-by: Andy Jiang <[email protected]> Co-authored-by: Hanghang Nate Liu <[email protected]>
apache#3559) * [GOBBLIN-1701] Replace jcenter with either maven central or gradle plugin portal (apache#3554) * remove jcentral * Use gradle plugin portal for shadow * Use maven central in all other cases * [GOBBLIN-1695] Fix: Failure to add spec executors doesn't block deployment (apache#3551) * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Add handling for fetchSession throwing an exception * Add logging when fails on constructor and initialization, but continue to local deploy * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Fixed vars * Revert changes on azkabanSpecProducer * clean up error throwing * revert function checking changes * Reformat file * Clean up function * Format file for try/catch * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Fixed rebase * Fixed rebase * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Add whitespace back * fix helix job wait completion bug when job goes to STOPPING state (apache#3556) address comments update stoppingStateEndTime with currentTime update test cases * [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs apache#3552 * before starting reduce * after first record is reduced * after reducing every 1000 records Co-authored-by: Urmi Mustafi <[email protected]> * Define basics for collecting Iceberg metadata for the current snapshot * [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (apache#3539) * [GOBBLIN-1673] Schema for dynamic work unit message * [GOBBLIN-1683] Dynamic Work Unit messaging abstractions * Address review comments * Correct import order Co-authored-by: Matthew Ho <[email protected]> Co-authored-by: Andy Jiang <[email protected]> Co-authored-by: Hanghang Nate Liu <[email protected]> Co-authored-by: umustafi <[email protected]> Co-authored-by: Urmi Mustafi <[email protected]> Co-authored-by: William Lo <[email protected]>
…one flow execution (apache#3558) * address comments * use connectionmanager when httpclient is not cloesable * [GOBBLIN-1706]Add DagActionStore to store the action to kill/resume one flow execution * add new flow execution handler which use DagactionStore to persist dag actions and let other host get the info * Make dag manager integrate with the dag action store * address comments * address comments * fix typo and add comments * [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs apache#3552 * before starting reduce * after first record is reduced * after reducing every 1000 records Co-authored-by: Urmi Mustafi <[email protected]> * [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (apache#3539) * [GOBBLIN-1673] Schema for dynamic work unit message * [GOBBLIN-1683] Dynamic Work Unit messaging abstractions * [GOBBLIN-1698] Fast fail during work unit generation based on config. (apache#3542) * fast fail during work unit generation based on config. * [GOBBLIN-1690] Added logging to ORC writer Closes apache#3543 from rdsr/master * [GOBBLIN-1678] Refactor git flowgraph component to be extensible (apache#3536) * Refactor git flowgraph component to be extensible * Move files to appropriate modules * Cleanup and add javadocs * Cleanup, add missing javadocs * Address review and import order * Fix findbugs * Use java sort instead of collections * Add GMCE topic explicitly to hive commit event (apache#3547) * [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode (apache#3544) * address comments * use connectionmanager when httpclient is not cloesable * [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode * add orchestor as listener before service start * fix code style * address comments * fix test case to test orchestor as one listener of flow spec * remove unintentional change * remove unused import * address comments * fix typo Co-authored-by: Zihan Li <[email protected]> * fast fail during work unit generation based on config. Co-authored-by: Meeth Gala <[email protected]> Co-authored-by: Ratandeep <[email protected]> Co-authored-by: William Lo <[email protected]> Co-authored-by: Jack Moseley <[email protected]> Co-authored-by: Zihan Li <[email protected]> Co-authored-by: Zihan Li <[email protected]> * Define basics for collecting Iceberg metadata for the current snapshot (apache#3559) * [GOBBLIN-1701] Replace jcenter with either maven central or gradle plugin portal (apache#3554) * remove jcentral * Use gradle plugin portal for shadow * Use maven central in all other cases * [GOBBLIN-1695] Fix: Failure to add spec executors doesn't block deployment (apache#3551) * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Add handling for fetchSession throwing an exception * Add logging when fails on constructor and initialization, but continue to local deploy * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Fixed vars * Revert changes on azkabanSpecProducer * clean up error throwing * revert function checking changes * Reformat file * Clean up function * Format file for try/catch * Allow first time failure to authenticate with Azkaban to fail silently * Fix findbugs report * Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow * Fixed rebase * Fixed rebase * Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor * Add whitespace back * fix helix job wait completion bug when job goes to STOPPING state (apache#3556) address comments update stoppingStateEndTime with currentTime update test cases * [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs apache#3552 * before starting reduce * after first record is reduced * after reducing every 1000 records Co-authored-by: Urmi Mustafi <[email protected]> * Define basics for collecting Iceberg metadata for the current snapshot * [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (apache#3539) * [GOBBLIN-1673] Schema for dynamic work unit message * [GOBBLIN-1683] Dynamic Work Unit messaging abstractions * Address review comments * Correct import order Co-authored-by: Matthew Ho <[email protected]> Co-authored-by: Andy Jiang <[email protected]> Co-authored-by: Hanghang Nate Liu <[email protected]> Co-authored-by: umustafi <[email protected]> Co-authored-by: Urmi Mustafi <[email protected]> Co-authored-by: William Lo <[email protected]> * [GOBBLIN-1710] Codecov should be optional in CI and not fail Github Actions (apache#3562) * [GOBBLIN-1711] Replace Jcenter with maven central (apache#3566) * [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding (apache#3549) * address comments * use connectionmanager when httpclient is not cloesable * fix test case to test orchestor as one listener of flow spec * remove unintentional change * [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding * fix compilation error * address comments * address comments * address comments * update outdated javadoc Co-authored-by: Zihan Li <[email protected]> * [GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg (apache#3560) * initial commit for iceberg distcp. * adding copy entity helper and icerbeg distcp template and test case. * Adding unit tests and refactoring method definitions for an Iceberg dataset. * resolve conflicts after cleaning history * update iceberg dataset and finder to include javadoc * addressed comments on PR and aligned code check style * renamed vars, added logging and updated javadoc * update dataset descriptor with ternary operation and rename fs to sourceFs * added source and target fs and update iceberg dataset finder constructor * Update source and dest dataset methods as protected and add req args constructor * change the order of attributes for iceberg dataset finder ctor * update iceberg dataset methods with correct source and target fs Co-authored-by: Meeth Gala <[email protected]> * [GOBBLIN-1707] Add `IcebergTableTest` unit test (apache#3564) * Add `IcebergTableTest` unit test * Fixup comment and indentation * Minor correction of `Long` => `Integer` * Correct comment * [GOBBLIN-1711] Replace Jcenter with maven central (apache#3566) * Minor rename of local var Co-authored-by: Matthew Ho <[email protected]> * [GOBBLIN-1708] Improve TimeAwareRecursiveCopyableDataset to lookback only into datefolders that match range (apache#3563) * Check datetime range validity prior to recursing * Remove unused packages * Remove extra line * Reformat function * Check string prior to parsing * removed unused import * Change checkpathdatetimevalidity to use available localdatetime library parsing functions * Change to isempty * Modify check path to be flexible * Update javadoc * Add unit tests and refactor * change bind class as GOBBLIN-1697 get merged Co-authored-by: Zihan Li <[email protected]> Co-authored-by: umustafi <[email protected]> Co-authored-by: Urmi Mustafi <[email protected]> Co-authored-by: Matthew Ho <[email protected]> Co-authored-by: meethngala <[email protected]> Co-authored-by: Meeth Gala <[email protected]> Co-authored-by: Ratandeep <[email protected]> Co-authored-by: William Lo <[email protected]> Co-authored-by: Jack Moseley <[email protected]> Co-authored-by: Kip Kohn <[email protected]> Co-authored-by: Andy Jiang <[email protected]> Co-authored-by: Hanghang Nate Liu <[email protected]>
* upstream/master: [GOBBLIN-1710] Codecov should be optional in CI and not fail Github Actions (apache#3562) Define basics for collecting Iceberg metadata for the current snapshot (apache#3559) [GOBBLIN-1698] Fast fail during work unit generation based on config. (apache#3542)
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Toward eventual full DistCp Iceberg support: walk an iceberg's metadata to determine all physical file paths for both metadata and actual data. At present these are the full contents; subsequently we'll calculate a delta against previously copied data.
Tests
I tested manually, and working now to create a test iceberg that I could check into
src/test/resources
, which has no PII. that would be a fast follow along w/ the test, if I'm unable to generate that before committing this (to share code and unblock collaborators).Commits