Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1732] Search for dummy file in writer directory #3589

Merged
merged 1 commit into from
Oct 21, 2022

Conversation

jack-moseley
Copy link
Contributor

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

Before, when sending change_property events, the GMCE publisher searches in the top level dataset directory for an example file. This could find files with other schemas/paths if there are different directories in the dataset directory. This PR changes it to search under the TimeBasedWriterPartitioner.WRITER_PARTITION_PREFIX folder if that config exists to avoid this.

Also, there was no reason why there was a list/for loop here before, because ConfigurationKeys.DATA_PUBLISHER_DATASET_DIR is a single path, so the for loop has been removed.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Updated unit test

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@jack-moseley jack-moseley changed the title [GOBBLIN-1732]Search for dummy file in writer directory [GOBBLIN-1732] Search for dummy file in writer directory Oct 21, 2022
Copy link
Contributor

@vikrambohra vikrambohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the JIRA ticket with details on the issue.

String baseDatasetString = state.getProp(ConfigurationKeys.DATA_PUBLISHER_DATASET_DIR);
Path searchPath = new Path(baseDatasetString);
if (state.contains(TimeBasedWriterPartitioner.WRITER_PARTITION_PREFIX)) {
searchPath = new Path(searchPath, state.getProp(TimeBasedWriterPartitioner.WRITER_PARTITION_PREFIX));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the WRITER_PARTITION_PREFIX always be hourly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For streaming yes this property is set like writer.partition.prefix=hourly

@codecov-commenter
Copy link

codecov-commenter commented Oct 21, 2022

Codecov Report

Merging #3589 (cfe26ac) into master (635e81f) will increase coverage by 3.66%.
The diff coverage is 68.75%.

@@             Coverage Diff              @@
##             master    #3589      +/-   ##
============================================
+ Coverage     43.22%   46.89%   +3.66%     
- Complexity     4866    10675    +5809     
============================================
  Files          1058     2119    +1061     
  Lines         42697    83027   +40330     
  Branches       4727     9245    +4518     
============================================
+ Hits          18457    38934   +20477     
- Misses        22441    40532   +18091     
- Partials       1799     3561    +1762     
Impacted Files Coverage Δ
...gobblin/iceberg/publisher/GobblinMCEPublisher.java 66.29% <68.75%> (+0.77%) ⬆️
...a/org/apache/gobblin/cluster/GobblinHelixTask.java 64.51% <0.00%> (-15.06%) ⬇️
.../gobblin/cluster/GobblinHelixTaskStateTracker.java 62.50% <0.00%> (-6.25%) ⬇️
.../modules/scheduler/GobblinServiceJobScheduler.java 63.20% <0.00%> (-1.42%) ⬇️
...he/gobblin/runtime/util/ExceptionCleanupUtils.java 60.00% <0.00%> (ø)
...che/gobblin/writer/http/RestJsonWriterBuilder.java 0.00% <0.00%> (ø)
...nstrumented/writer/InstrumentedDataWriterBase.java 46.15% <0.00%> (ø)
...a/org/apache/gobblin/hive/HiveTableComparator.java 0.00% <0.00%> (ø)
.../java/org/apache/gobblin/metrics/InnerCounter.java 100.00% <0.00%> (ø)
...l/io/orc/TypeDescriptionToObjectInspectorUtil.java 0.00% <0.00%> (ø)
... and 1059 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@vikrambohra
Copy link
Contributor

LGTM. Thanks for the quick fix.

@Will-Lo Will-Lo merged commit 13ed3d4 into apache:master Oct 21, 2022
phet pushed a commit to phet/gobblin that referenced this pull request Dec 3, 2022
phet added a commit to phet/gobblin that referenced this pull request Dec 5, 2022
* upstream/master:
  move dataset handler code before cleaning up staging data (apache#3594)
  [GOBBLIN-1730] Include flow execution id when try to cancel/submit job using SimpleKafkaSpecProducer (apache#3588)
  [GOBBLIN-1734] make DestinationDatasetHandler work on streaming sources (apache#3592)
  give option to cancel helix workflow through Delete API (apache#3580)
  [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior (apache#3586)
  Support multiple node types in shared flowgraph, fix logs (apache#3590)
  Search for dummy file in writer directory (apache#3589)
  Use root cause for checking if exception is transient (apache#3585)
  [GOBBLIN-1724] Support a shared flowgraph layout in GaaS (apache#3583)
  [GOBBLIN-1731] Enable HiveMetadataWriter to override table schema lit… (apache#3587)
  [GOBBLIN-1726] Avro 1.9 upgrade of Gobblin OSS (apache#3581)
  [GOBBLIN-1725] Fix bugs in gaas warm standby mode (apache#3582)
  [GOBBLIN-1718] Define DagActionStoreMonitor to listen for kill/resume… (apache#3572)
  Add log line for committing/retrieving watermarks in streaming (apache#3578)
  [GOBBLIN-1707] Enhance `IcebergDataset` to detect when files already at dest then proceed with only delta (apache#3575)
  Ignore AlreadyExistsException in hive writer (apache#3579)
  Fail GMIP container for known transient exceptions to avoid data loss (apache#3576)
  GOBBLIN-1715: Support vectorized row batch pooling (apache#3574)
  [GOBBLIN-1696] Implement file based flowgraph that detects changes to the underlying… (apache#3548)
  GOBBLIN-1719 Replace moveToTrash with moveToAppropriateTrash for hadoop trash (apache#3573)
  [GOBBLIN-1703] avoid double quota increase for adhoc flows (apache#3550)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants