-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-1732] Search for dummy file in writer directory #3589
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the JIRA ticket with details on the issue.
String baseDatasetString = state.getProp(ConfigurationKeys.DATA_PUBLISHER_DATASET_DIR); | ||
Path searchPath = new Path(baseDatasetString); | ||
if (state.contains(TimeBasedWriterPartitioner.WRITER_PARTITION_PREFIX)) { | ||
searchPath = new Path(searchPath, state.getProp(TimeBasedWriterPartitioner.WRITER_PARTITION_PREFIX)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the WRITER_PARTITION_PREFIX always be hourly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For streaming yes this property is set like writer.partition.prefix=hourly
Codecov Report
@@ Coverage Diff @@
## master #3589 +/- ##
============================================
+ Coverage 43.22% 46.89% +3.66%
- Complexity 4866 10675 +5809
============================================
Files 1058 2119 +1061
Lines 42697 83027 +40330
Branches 4727 9245 +4518
============================================
+ Hits 18457 38934 +20477
- Misses 22441 40532 +18091
- Partials 1799 3561 +1762
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
LGTM. Thanks for the quick fix. |
* upstream/master: move dataset handler code before cleaning up staging data (apache#3594) [GOBBLIN-1730] Include flow execution id when try to cancel/submit job using SimpleKafkaSpecProducer (apache#3588) [GOBBLIN-1734] make DestinationDatasetHandler work on streaming sources (apache#3592) give option to cancel helix workflow through Delete API (apache#3580) [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior (apache#3586) Support multiple node types in shared flowgraph, fix logs (apache#3590) Search for dummy file in writer directory (apache#3589) Use root cause for checking if exception is transient (apache#3585) [GOBBLIN-1724] Support a shared flowgraph layout in GaaS (apache#3583) [GOBBLIN-1731] Enable HiveMetadataWriter to override table schema lit… (apache#3587) [GOBBLIN-1726] Avro 1.9 upgrade of Gobblin OSS (apache#3581) [GOBBLIN-1725] Fix bugs in gaas warm standby mode (apache#3582) [GOBBLIN-1718] Define DagActionStoreMonitor to listen for kill/resume… (apache#3572) Add log line for committing/retrieving watermarks in streaming (apache#3578) [GOBBLIN-1707] Enhance `IcebergDataset` to detect when files already at dest then proceed with only delta (apache#3575) Ignore AlreadyExistsException in hive writer (apache#3579) Fail GMIP container for known transient exceptions to avoid data loss (apache#3576) GOBBLIN-1715: Support vectorized row batch pooling (apache#3574) [GOBBLIN-1696] Implement file based flowgraph that detects changes to the underlying… (apache#3548) GOBBLIN-1719 Replace moveToTrash with moveToAppropriateTrash for hadoop trash (apache#3573) [GOBBLIN-1703] avoid double quota increase for adhoc flows (apache#3550)
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Before, when sending change_property events, the GMCE publisher searches in the top level dataset directory for an example file. This could find files with other schemas/paths if there are different directories in the dataset directory. This PR changes it to search under the
TimeBasedWriterPartitioner.WRITER_PARTITION_PREFIX
folder if that config exists to avoid this.Also, there was no reason why there was a list/for loop here before, because
ConfigurationKeys.DATA_PUBLISHER_DATASET_DIR
is a single path, so the for loop has been removed.Tests
Updated unit test
Commits