Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg #3560

Merged
merged 12 commits into from
Sep 22, 2022

Conversation

meethngala
Copy link
Contributor

@meethngala meethngala commented Sep 13, 2022

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
  • This PR is focused on Hive Catalog based Iceberg Tables
  • It finds the relevant Iceberg datasets mapping to an Iceberg table and creates copy entities out of all the files that needs to be duplicated as part of the Distcp process
  • Each copy entity is then submitted to the copy source for its respective work unit generation which will then publish the data on the destination

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
  • I have added unit test for finding the relevant Iceberg datasets and generating copy entities out of the exact files to be copied.

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Sep 13, 2022

Codecov Report

Merging #3560 (93097ce) into master (5ae4ec7) will increase coverage by 0.03%.
The diff coverage is 64.13%.

@@             Coverage Diff              @@
##             master    #3560      +/-   ##
============================================
+ Coverage     46.77%   46.80%   +0.03%     
- Complexity    10512    10544      +32     
============================================
  Files          2099     2105       +6     
  Lines         81988    82136     +148     
  Branches       9132     9144      +12     
============================================
+ Hits          38349    38444      +95     
- Misses        40096    40148      +52     
- Partials       3543     3544       +1     
Impacted Files Coverage Δ
...a/org/apache/gobblin/dataset/DatasetConstants.java 0.00% <ø> (ø)
.../management/copy/iceberg/IcebergDatasetFinder.java 0.00% <0.00%> (ø)
...a/management/copy/iceberg/IcebergTableFileSet.java 0.00% <0.00%> (ø)
...n/data/management/copy/iceberg/IcebergDataset.java 84.28% <84.28%> (ø)
...che/gobblin/runtime/api/InstrumentedSpecStore.java 79.54% <0.00%> (-3.79%) ⬇️
...a/org/apache/gobblin/cluster/GobblinHelixTask.java 64.51% <0.00%> (-2.16%) ⬇️
.../modules/scheduler/GobblinServiceJobScheduler.java 64.11% <0.00%> (-1.44%) ⬇️
...lin/elasticsearch/writer/FutureCallbackHolder.java 61.42% <0.00%> (-1.43%) ⬇️
...ache/gobblin/runtime/spec_catalog/FlowCatalog.java 46.54% <0.00%> (-0.56%) ⬇️
...obblin/service/FlowConfigResourceLocalHandler.java 15.68% <0.00%> (-0.16%) ⬇️
... and 10 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good start here, meeth.

do perhaps consider retitling the commit since this is about IcebergDataset, IcebergFileSet and CopyEntities

Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice revision... very close now.

again, I still suggest retitling. maybe "Add IcebergDatasetsFinder to generate CopyEntities for Iceberg Distcp"?

@meethngala meethngala force-pushed the create-workunits-for-iceberg-distcp branch from c77f229 to a792af1 Compare September 15, 2022 23:26
Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost looks like some changes you recently made have been backed out... I'll let you guide me before I continue re-reading

Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is looking much better! now mostly just some tips on late-breaking logging you added

@meethngala meethngala changed the title [GOBBLIN-1709] Create workunits for Hive Catalog based Iceberg datasets to support Distcp for Iceberg [GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg Sep 19, 2022
@meethngala meethngala requested a review from Will-Lo September 19, 2022 06:37
Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great meeth--nice work!

@meethngala meethngala requested a review from Will-Lo September 20, 2022 16:26
Copy link
Contributor

@phet phet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great--I believe we're finally there, nice work!

@Will-Lo Will-Lo merged commit 18ec55b into apache:master Sep 22, 2022
phet pushed a commit to phet/gobblin that referenced this pull request Sep 22, 2022
…leSet to generate Copy Entities to support Distcp for Iceberg (apache#3560)

* initial commit for iceberg distcp.

* adding copy entity helper and icerbeg distcp template and test case.

* Adding unit tests and refactoring method definitions for an Iceberg dataset.

* resolve conflicts after cleaning history

* update iceberg dataset and finder to include javadoc

* addressed comments on PR and aligned code check style

* renamed vars, added logging and updated javadoc

* update dataset descriptor with ternary operation and rename fs to sourceFs

* added source and target fs and update iceberg dataset finder constructor

* Update source and dest dataset methods as protected and add req args constructor

* change the order of attributes for iceberg dataset finder ctor

* update iceberg dataset methods with correct source and target fs

Co-authored-by: Meeth Gala <[email protected]>
ZihanLi58 pushed a commit to ZihanLi58/incubator-gobblin that referenced this pull request Sep 23, 2022
…leSet to generate Copy Entities to support Distcp for Iceberg (apache#3560)

* initial commit for iceberg distcp.

* adding copy entity helper and icerbeg distcp template and test case.

* Adding unit tests and refactoring method definitions for an Iceberg dataset.

* resolve conflicts after cleaning history

* update iceberg dataset and finder to include javadoc

* addressed comments on PR and aligned code check style

* renamed vars, added logging and updated javadoc

* update dataset descriptor with ternary operation and rename fs to sourceFs

* added source and target fs and update iceberg dataset finder constructor

* Update source and dest dataset methods as protected and add req args constructor

* change the order of attributes for iceberg dataset finder ctor

* update iceberg dataset methods with correct source and target fs

Co-authored-by: Meeth Gala <[email protected]>
phet added a commit to phet/gobblin that referenced this pull request Sep 23, 2022
…can_icebergs_incrementally

* upstream/master:
  [GOBBLIN-1704] Purge offline helix instances during startup (apache#3561)
  [GOBBLIN-1708] Improve TimeAwareRecursiveCopyableDataset to lookback only into datefolders that match range (apache#3563)
  [GOBBLIN-1707] Add `IcebergTableTest` unit test (apache#3564)
  [GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg (apache#3560)
  [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding (apache#3549)
  [GOBBLIN-1711] Replace Jcenter with maven central (apache#3566)
ZihanLi58 added a commit that referenced this pull request Sep 23, 2022
…one flow execution (#3558)

* address comments

* use connectionmanager when httpclient is not cloesable

* [GOBBLIN-1706]Add DagActionStore to store the action to kill/resume one flow execution

* add new flow execution handler which use DagactionStore to persist dag actions and let other host get the info

* Make dag manager integrate with the dag action store

* address comments

* address comments

* fix typo and add comments

* [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs #3552

* before starting reduce
* after first record is reduced
* after reducing every 1000 records

Co-authored-by: Urmi Mustafi <[email protected]>

* [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (#3539)

* [GOBBLIN-1673] Schema for dynamic work unit message

* [GOBBLIN-1683] Dynamic Work Unit messaging abstractions

* [GOBBLIN-1698] Fast fail during work unit generation based on config. (#3542)

* fast fail during work unit generation based on config.

* [GOBBLIN-1690] Added logging to ORC writer

Closes #3543 from rdsr/master

* [GOBBLIN-1678] Refactor git flowgraph component to be extensible (#3536)

* Refactor git flowgraph component to be extensible

* Move files to appropriate modules

* Cleanup and add javadocs

* Cleanup, add missing javadocs

* Address review and import order

* Fix findbugs

* Use java sort instead of collections

* Add GMCE topic explicitly to hive commit event (#3547)

* [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode (#3544)

* address comments

* use connectionmanager when httpclient is not cloesable

* [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode

* add orchestor as listener before service start

* fix code style

* address comments

* fix test case to test orchestor as one listener of flow spec

* remove unintentional change

* remove unused import

* address comments

* fix typo

Co-authored-by: Zihan Li <[email protected]>

* fast fail during work unit generation based on config.

Co-authored-by: Meeth Gala <[email protected]>
Co-authored-by: Ratandeep <[email protected]>
Co-authored-by: William Lo <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>
Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: Zihan Li <[email protected]>

* Define basics for collecting Iceberg metadata for the current snapshot (#3559)

* [GOBBLIN-1701] Replace jcenter with either maven central or gradle plugin portal (#3554)

* remove jcentral
* Use gradle plugin portal for shadow
* Use maven central in all other cases

* [GOBBLIN-1695] Fix: Failure to add spec executors doesn't block deployment (#3551)

* Allow first time failure to authenticate with Azkaban to fail silently

* Fix findbugs report

* Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow

* Add handling for fetchSession throwing an exception

* Add logging when fails on constructor and initialization, but continue to local deploy

* Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor

* Fixed vars

* Revert changes on azkabanSpecProducer

* clean up error throwing

* revert function checking changes

* Reformat file

* Clean up function

* Format file for try/catch

* Allow first time failure to authenticate with Azkaban to fail silently

* Fix findbugs report

* Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow

* Fixed rebase

* Fixed rebase

* Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor

* Add whitespace back

* fix helix job wait completion bug when job goes to STOPPING state (#3556)

address comments

update stoppingStateEndTime with currentTime

update test cases

* [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs #3552

* before starting reduce
* after first record is reduced
* after reducing every 1000 records

Co-authored-by: Urmi Mustafi <[email protected]>

* Define basics for collecting Iceberg metadata for the current snapshot

* [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (#3539)

* [GOBBLIN-1673] Schema for dynamic work unit message

* [GOBBLIN-1683] Dynamic Work Unit messaging abstractions

* Address review comments

* Correct import order

Co-authored-by: Matthew Ho <[email protected]>
Co-authored-by: Andy Jiang <[email protected]>
Co-authored-by: Hanghang Nate Liu <[email protected]>
Co-authored-by: umustafi <[email protected]>
Co-authored-by: Urmi Mustafi <[email protected]>
Co-authored-by: William Lo <[email protected]>

* [GOBBLIN-1710]  Codecov should be optional in CI and not fail Github Actions (#3562)

* [GOBBLIN-1711] Replace Jcenter with maven central (#3566)

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding (#3549)

* address comments

* use connectionmanager when httpclient is not cloesable

* fix test case to test orchestor as one listener of flow spec

* remove unintentional change

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding

* fix compilation error

* address comments

* address comments

* address comments

* update outdated javadoc

Co-authored-by: Zihan Li <[email protected]>

* [GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg (#3560)

* initial commit for iceberg distcp.

* adding copy entity helper and icerbeg distcp template and test case.

* Adding unit tests and refactoring method definitions for an Iceberg dataset.

* resolve conflicts after cleaning history

* update iceberg dataset and finder to include javadoc

* addressed comments on PR and aligned code check style

* renamed vars, added logging and updated javadoc

* update dataset descriptor with ternary operation and rename fs to sourceFs

* added source and target fs and update iceberg dataset finder constructor

* Update source and dest dataset methods as protected and add req args constructor

* change the order of attributes for iceberg dataset finder ctor

* update iceberg dataset methods with correct source and target fs

Co-authored-by: Meeth Gala <[email protected]>

* [GOBBLIN-1707] Add `IcebergTableTest` unit test (#3564)

* Add `IcebergTableTest` unit test

* Fixup comment and indentation

* Minor correction of `Long` => `Integer`

* Correct comment

* [GOBBLIN-1711] Replace Jcenter with maven central (#3566)

* Minor rename of local var

Co-authored-by: Matthew Ho <[email protected]>

* [GOBBLIN-1708] Improve TimeAwareRecursiveCopyableDataset to lookback only into datefolders that match range (#3563)

* Check datetime range validity prior to recursing

* Remove unused packages

* Remove extra line

* Reformat function

* Check string prior to parsing

* removed unused import

* Change checkpathdatetimevalidity to use available localdatetime library parsing functions

* Change to isempty

* Modify check path to be flexible

* Update javadoc

* Add unit tests and refactor

* change bind class as GOBBLIN-1697 get merged

Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: umustafi <[email protected]>
Co-authored-by: Urmi Mustafi <[email protected]>
Co-authored-by: Matthew Ho <[email protected]>
Co-authored-by: meethngala <[email protected]>
Co-authored-by: Meeth Gala <[email protected]>
Co-authored-by: Ratandeep <[email protected]>
Co-authored-by: William Lo <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>
Co-authored-by: Kip Kohn <[email protected]>
Co-authored-by: Andy Jiang <[email protected]>
Co-authored-by: Hanghang Nate Liu <[email protected]>
ZihanLi58 added a commit that referenced this pull request Sep 27, 2022
…not only the current one (#3569)

* Add `IcebergTableTest` unit test

* Fixup comment and indentation

* Minor correction of `Long` => `Integer`

* Correct comment

* [GOBBLIN-1711] Replace Jcenter with maven central (#3566)

* Minor rename of local var

* Extend `IcebergTable` to collect Iceberg metadata across all snapshots

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding (#3549)

* address comments

* use connectionmanager when httpclient is not cloesable

* fix test case to test orchestor as one listener of flow spec

* remove unintentional change

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding

* fix compilation error

* address comments

* address comments

* address comments

* update outdated javadoc

Co-authored-by: Zihan Li <[email protected]>

* [GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg (#3560)

* initial commit for iceberg distcp.

* adding copy entity helper and icerbeg distcp template and test case.

* Adding unit tests and refactoring method definitions for an Iceberg dataset.

* resolve conflicts after cleaning history

* update iceberg dataset and finder to include javadoc

* addressed comments on PR and aligned code check style

* renamed vars, added logging and updated javadoc

* update dataset descriptor with ternary operation and rename fs to sourceFs

* added source and target fs and update iceberg dataset finder constructor

* Update source and dest dataset methods as protected and add req args constructor

* change the order of attributes for iceberg dataset finder ctor

* update iceberg dataset methods with correct source and target fs

Co-authored-by: Meeth Gala <[email protected]>

* Update `IcebergDataset` to use `IcebergTable.getIncrementalSnapshotInfosIterator` rather than `.getCurrentSnapshotInfo`

* Augment `IcebergDatasetTest` unit test to exercise mult-snapshot icebergs

* Minor javadoc Update

* Throw `IcebergTable.TableNotFoundException` when no such table found

Co-authored-by: Matthew Ho <[email protected]>
Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: meethngala <[email protected]>
Co-authored-by: Meeth Gala <[email protected]>
arjun4084346 pushed a commit to arjun4084346/gobblin that referenced this pull request Sep 28, 2022
…not only the current one (apache#3569)

* Add `IcebergTableTest` unit test

* Fixup comment and indentation

* Minor correction of `Long` => `Integer`

* Correct comment

* [GOBBLIN-1711] Replace Jcenter with maven central (apache#3566)

* Minor rename of local var

* Extend `IcebergTable` to collect Iceberg metadata across all snapshots

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding (apache#3549)

* address comments

* use connectionmanager when httpclient is not cloesable

* fix test case to test orchestor as one listener of flow spec

* remove unintentional change

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding

* fix compilation error

* address comments

* address comments

* address comments

* update outdated javadoc

Co-authored-by: Zihan Li <[email protected]>

* [GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg (apache#3560)

* initial commit for iceberg distcp.

* adding copy entity helper and icerbeg distcp template and test case.

* Adding unit tests and refactoring method definitions for an Iceberg dataset.

* resolve conflicts after cleaning history

* update iceberg dataset and finder to include javadoc

* addressed comments on PR and aligned code check style

* renamed vars, added logging and updated javadoc

* update dataset descriptor with ternary operation and rename fs to sourceFs

* added source and target fs and update iceberg dataset finder constructor

* Update source and dest dataset methods as protected and add req args constructor

* change the order of attributes for iceberg dataset finder ctor

* update iceberg dataset methods with correct source and target fs

Co-authored-by: Meeth Gala <[email protected]>

* Update `IcebergDataset` to use `IcebergTable.getIncrementalSnapshotInfosIterator` rather than `.getCurrentSnapshotInfo`

* Augment `IcebergDatasetTest` unit test to exercise mult-snapshot icebergs

* Minor javadoc Update

* Throw `IcebergTable.TableNotFoundException` when no such table found

Co-authored-by: Matthew Ho <[email protected]>
Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: meethngala <[email protected]>
Co-authored-by: Meeth Gala <[email protected]>
phet pushed a commit to phet/gobblin that referenced this pull request Sep 29, 2022
…leSet to generate Copy Entities to support Distcp for Iceberg (apache#3560)

* initial commit for iceberg distcp.

* adding copy entity helper and icerbeg distcp template and test case.

* Adding unit tests and refactoring method definitions for an Iceberg dataset.

* resolve conflicts after cleaning history

* update iceberg dataset and finder to include javadoc

* addressed comments on PR and aligned code check style

* renamed vars, added logging and updated javadoc

* update dataset descriptor with ternary operation and rename fs to sourceFs

* added source and target fs and update iceberg dataset finder constructor

* Update source and dest dataset methods as protected and add req args constructor

* change the order of attributes for iceberg dataset finder ctor

* update iceberg dataset methods with correct source and target fs

Co-authored-by: Meeth Gala <[email protected]>
phet added a commit to phet/gobblin that referenced this pull request Sep 29, 2022
…one flow execution (apache#3558)

* address comments

* use connectionmanager when httpclient is not cloesable

* [GOBBLIN-1706]Add DagActionStore to store the action to kill/resume one flow execution

* add new flow execution handler which use DagactionStore to persist dag actions and let other host get the info

* Make dag manager integrate with the dag action store

* address comments

* address comments

* fix typo and add comments

* [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs apache#3552

* before starting reduce
* after first record is reduced
* after reducing every 1000 records

Co-authored-by: Urmi Mustafi <[email protected]>

* [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (apache#3539)

* [GOBBLIN-1673] Schema for dynamic work unit message

* [GOBBLIN-1683] Dynamic Work Unit messaging abstractions

* [GOBBLIN-1698] Fast fail during work unit generation based on config. (apache#3542)

* fast fail during work unit generation based on config.

* [GOBBLIN-1690] Added logging to ORC writer

Closes apache#3543 from rdsr/master

* [GOBBLIN-1678] Refactor git flowgraph component to be extensible (apache#3536)

* Refactor git flowgraph component to be extensible

* Move files to appropriate modules

* Cleanup and add javadocs

* Cleanup, add missing javadocs

* Address review and import order

* Fix findbugs

* Use java sort instead of collections

* Add GMCE topic explicitly to hive commit event (apache#3547)

* [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode (apache#3544)

* address comments

* use connectionmanager when httpclient is not cloesable

* [GOBBLIN-1689] Decouple compiler from scheduler in warm standby mode

* add orchestor as listener before service start

* fix code style

* address comments

* fix test case to test orchestor as one listener of flow spec

* remove unintentional change

* remove unused import

* address comments

* fix typo

Co-authored-by: Zihan Li <[email protected]>

* fast fail during work unit generation based on config.

Co-authored-by: Meeth Gala <[email protected]>
Co-authored-by: Ratandeep <[email protected]>
Co-authored-by: William Lo <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>
Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: Zihan Li <[email protected]>

* Define basics for collecting Iceberg metadata for the current snapshot (apache#3559)

* [GOBBLIN-1701] Replace jcenter with either maven central or gradle plugin portal (apache#3554)

* remove jcentral
* Use gradle plugin portal for shadow
* Use maven central in all other cases

* [GOBBLIN-1695] Fix: Failure to add spec executors doesn't block deployment (apache#3551)

* Allow first time failure to authenticate with Azkaban to fail silently

* Fix findbugs report

* Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow

* Add handling for fetchSession throwing an exception

* Add logging when fails on constructor and initialization, but continue to local deploy

* Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor

* Fixed vars

* Revert changes on azkabanSpecProducer

* clean up error throwing

* revert function checking changes

* Reformat file

* Clean up function

* Format file for try/catch

* Allow first time failure to authenticate with Azkaban to fail silently

* Fix findbugs report

* Refactor azkaban authentication into function. Call on init and if session_id is null when adding a flow

* Fixed rebase

* Fixed rebase

* Revert changes for azkabanSpecProducer, but quiet log instead of throw in constructor

* Add whitespace back

* fix helix job wait completion bug when job goes to STOPPING state (apache#3556)

address comments

update stoppingStateEndTime with currentTime

update test cases

* [GOBBLIN-1699] Log progress of reducer task for visibility with slow compaction jobs apache#3552

* before starting reduce
* after first record is reduced
* after reducing every 1000 records

Co-authored-by: Urmi Mustafi <[email protected]>

* Define basics for collecting Iceberg metadata for the current snapshot

* [GOBBLIN-1673][GOBBLIN-1683] Skeleton code for handling messages between task runner / application master for Dynamic work unit allocation (apache#3539)

* [GOBBLIN-1673] Schema for dynamic work unit message

* [GOBBLIN-1683] Dynamic Work Unit messaging abstractions

* Address review comments

* Correct import order

Co-authored-by: Matthew Ho <[email protected]>
Co-authored-by: Andy Jiang <[email protected]>
Co-authored-by: Hanghang Nate Liu <[email protected]>
Co-authored-by: umustafi <[email protected]>
Co-authored-by: Urmi Mustafi <[email protected]>
Co-authored-by: William Lo <[email protected]>

* [GOBBLIN-1710]  Codecov should be optional in CI and not fail Github Actions (apache#3562)

* [GOBBLIN-1711] Replace Jcenter with maven central (apache#3566)

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding (apache#3549)

* address comments

* use connectionmanager when httpclient is not cloesable

* fix test case to test orchestor as one listener of flow spec

* remove unintentional change

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding

* fix compilation error

* address comments

* address comments

* address comments

* update outdated javadoc

Co-authored-by: Zihan Li <[email protected]>

* [GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg (apache#3560)

* initial commit for iceberg distcp.

* adding copy entity helper and icerbeg distcp template and test case.

* Adding unit tests and refactoring method definitions for an Iceberg dataset.

* resolve conflicts after cleaning history

* update iceberg dataset and finder to include javadoc

* addressed comments on PR and aligned code check style

* renamed vars, added logging and updated javadoc

* update dataset descriptor with ternary operation and rename fs to sourceFs

* added source and target fs and update iceberg dataset finder constructor

* Update source and dest dataset methods as protected and add req args constructor

* change the order of attributes for iceberg dataset finder ctor

* update iceberg dataset methods with correct source and target fs

Co-authored-by: Meeth Gala <[email protected]>

* [GOBBLIN-1707] Add `IcebergTableTest` unit test (apache#3564)

* Add `IcebergTableTest` unit test

* Fixup comment and indentation

* Minor correction of `Long` => `Integer`

* Correct comment

* [GOBBLIN-1711] Replace Jcenter with maven central (apache#3566)

* Minor rename of local var

Co-authored-by: Matthew Ho <[email protected]>

* [GOBBLIN-1708] Improve TimeAwareRecursiveCopyableDataset to lookback only into datefolders that match range (apache#3563)

* Check datetime range validity prior to recursing

* Remove unused packages

* Remove extra line

* Reformat function

* Check string prior to parsing

* removed unused import

* Change checkpathdatetimevalidity to use available localdatetime library parsing functions

* Change to isempty

* Modify check path to be flexible

* Update javadoc

* Add unit tests and refactor

* change bind class as GOBBLIN-1697 get merged

Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: umustafi <[email protected]>
Co-authored-by: Urmi Mustafi <[email protected]>
Co-authored-by: Matthew Ho <[email protected]>
Co-authored-by: meethngala <[email protected]>
Co-authored-by: Meeth Gala <[email protected]>
Co-authored-by: Ratandeep <[email protected]>
Co-authored-by: William Lo <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>
Co-authored-by: Kip Kohn <[email protected]>
Co-authored-by: Andy Jiang <[email protected]>
Co-authored-by: Hanghang Nate Liu <[email protected]>
phet added a commit to phet/gobblin that referenced this pull request Sep 29, 2022
…not only the current one (apache#3569)

* Add `IcebergTableTest` unit test

* Fixup comment and indentation

* Minor correction of `Long` => `Integer`

* Correct comment

* [GOBBLIN-1711] Replace Jcenter with maven central (apache#3566)

* Minor rename of local var

* Extend `IcebergTable` to collect Iceberg metadata across all snapshots

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding (apache#3549)

* address comments

* use connectionmanager when httpclient is not cloesable

* fix test case to test orchestor as one listener of flow spec

* remove unintentional change

* [GOBBLIN-1697]Have a separate resource handler to rely on CDC stream to do message forwarding

* fix compilation error

* address comments

* address comments

* address comments

* update outdated javadoc

Co-authored-by: Zihan Li <[email protected]>

* [GOBBLIN-1709] Create Iceberg Datasets Finder, Iceberg Dataset and FileSet to generate Copy Entities to support Distcp for Iceberg (apache#3560)

* initial commit for iceberg distcp.

* adding copy entity helper and icerbeg distcp template and test case.

* Adding unit tests and refactoring method definitions for an Iceberg dataset.

* resolve conflicts after cleaning history

* update iceberg dataset and finder to include javadoc

* addressed comments on PR and aligned code check style

* renamed vars, added logging and updated javadoc

* update dataset descriptor with ternary operation and rename fs to sourceFs

* added source and target fs and update iceberg dataset finder constructor

* Update source and dest dataset methods as protected and add req args constructor

* change the order of attributes for iceberg dataset finder ctor

* update iceberg dataset methods with correct source and target fs

Co-authored-by: Meeth Gala <[email protected]>

* Update `IcebergDataset` to use `IcebergTable.getIncrementalSnapshotInfosIterator` rather than `.getCurrentSnapshotInfo`

* Augment `IcebergDatasetTest` unit test to exercise mult-snapshot icebergs

* Minor javadoc Update

* Throw `IcebergTable.TableNotFoundException` when no such table found

Co-authored-by: Matthew Ho <[email protected]>
Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: Zihan Li <[email protected]>
Co-authored-by: meethngala <[email protected]>
Co-authored-by: Meeth Gala <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants