Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GOBBLIN-1715: Support for Vectorized row batch pooling #3574

Merged
merged 1 commit into from
Oct 11, 2022

Conversation

rdsr
Copy link
Contributor

@rdsr rdsr commented Sep 30, 2022

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    As mentioned in the ticket the existing array resizing is buggy for ORC's vectorized row batch. In almost all the cases we should fall back on the 'enlargeFactor' option. In some cases where throughput may be a big concern. We should enable the proposed approach below. We've seen some benefits of using the current approach.
  1. Major GC pauses have reduced
    BEFORE Before
    AFTER After

  2. Memory is only allocated when needed which when compared to the old approach allocated a lot more memory even when it was not required. [For details please see the ticket]
    BEFORE: before2
    AFTER: Screen Shot 2022-09-30 at 3 41 05 PM

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • [x ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2022

Codecov Report

Merging #3574 (02cf437) into master (a246e23) will decrease coverage by 0.00%.
The diff coverage is 85.96%.

@@             Coverage Diff              @@
##             master    #3574      +/-   ##
============================================
- Coverage     46.87%   46.87%   -0.01%     
- Complexity    10630    10640      +10     
============================================
  Files          2112     2113       +1     
  Lines         82744    82798      +54     
  Branches       9215     9220       +5     
============================================
+ Hits          38784    38808      +24     
- Misses        40404    40429      +25     
- Partials       3556     3561       +5     
Impacted Files Coverage Δ
...rg/apache/gobblin/writer/GobblinBaseOrcWriter.java 65.45% <60.00%> (-1.54%) ⬇️
...n/java/org/apache/gobblin/writer/RowBatchPool.java 91.48% <91.48%> (ø)
...a/org/apache/gobblin/cluster/GobblinHelixTask.java 62.36% <0.00%> (-21.51%) ⬇️
.../gobblin/cluster/GobblinHelixTaskStateTracker.java 62.50% <0.00%> (-6.25%) ⬇️
...t/version/TimestampedDatasetStateStoreVersion.java 27.77% <0.00%> (-5.56%) ⬇️
.../apache/gobblin/runtime/api/JobExecutionState.java 79.43% <0.00%> (-0.94%) ⬇️
.../java/org/apache/gobblin/cluster/SleepingTask.java 45.45% <0.00%> (+6.06%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@homatthew homatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits but nothing major. Great work!


static final String ROW_BATCH_EXPIRY_PERIOD = "orc.row.batch.expiry.period.secs";
static final int ROW_BATCH_EXPIRY_PERIOD_DEFAULT = 1;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really minor nits about style. Typically we follow a pattern of having a shared prefix variable for orc.row.batch.expiry. or even just orc.row.batch as the prefix. And then we start the default values with the name DEFAULT_

See

public static final String TOKEN_RENEW_INTERVAL_IN_MINUTES = GOBBLIN_YARN_PREFIX + "token.renew.interval.minutes";
public static final Long DEFAULT_TOKEN_RENEW_INTERVAL_IN_MINUTES = Long.MAX_VALUE;
for an example.

Also, maybe add the word pool to these settings since these are specific to the batch pool and not for regular row batch?

@@ -269,6 +280,7 @@ public void commit()
@Override
public void write(D record)
throws IOException {
Preconditions.checkState(!closed, "Writer already closed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When did you see this edge case happen? And does this cause the fork to immediately terminate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there's so much usage and abstract classes which this extends I wanted to be sure that this condition is preserved

synchronized (rowBatches) {
LinkedList<RowBatchHolder> vals = rowBatches.get(schema);
VectorizedRowBatch rowBatch;
if (vals == null || vals.size() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question but do we have a preference between apache commons CollectionUtils.isEmpty and a basic null check like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not keen on adding more dependencies here if this is a small change

@umustafi
Copy link
Contributor

umustafi commented Oct 3, 2022

btw service tests are failing checkstyle because you need the apache docstring header at the top of new files

Copy link
Contributor

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ZihanLi58 ZihanLi58 merged commit fafd40b into apache:master Oct 11, 2022
phet pushed a commit to phet/gobblin that referenced this pull request Dec 3, 2022
phet added a commit to phet/gobblin that referenced this pull request Dec 5, 2022
* upstream/master:
  move dataset handler code before cleaning up staging data (apache#3594)
  [GOBBLIN-1730] Include flow execution id when try to cancel/submit job using SimpleKafkaSpecProducer (apache#3588)
  [GOBBLIN-1734] make DestinationDatasetHandler work on streaming sources (apache#3592)
  give option to cancel helix workflow through Delete API (apache#3580)
  [GOBBLIN-1728] Fix YarnService incorrect container allocation behavior (apache#3586)
  Support multiple node types in shared flowgraph, fix logs (apache#3590)
  Search for dummy file in writer directory (apache#3589)
  Use root cause for checking if exception is transient (apache#3585)
  [GOBBLIN-1724] Support a shared flowgraph layout in GaaS (apache#3583)
  [GOBBLIN-1731] Enable HiveMetadataWriter to override table schema lit… (apache#3587)
  [GOBBLIN-1726] Avro 1.9 upgrade of Gobblin OSS (apache#3581)
  [GOBBLIN-1725] Fix bugs in gaas warm standby mode (apache#3582)
  [GOBBLIN-1718] Define DagActionStoreMonitor to listen for kill/resume… (apache#3572)
  Add log line for committing/retrieving watermarks in streaming (apache#3578)
  [GOBBLIN-1707] Enhance `IcebergDataset` to detect when files already at dest then proceed with only delta (apache#3575)
  Ignore AlreadyExistsException in hive writer (apache#3579)
  Fail GMIP container for known transient exceptions to avoid data loss (apache#3576)
  GOBBLIN-1715: Support vectorized row batch pooling (apache#3574)
  [GOBBLIN-1696] Implement file based flowgraph that detects changes to the underlying… (apache#3548)
  GOBBLIN-1719 Replace moveToTrash with moveToAppropriateTrash for hadoop trash (apache#3573)
  [GOBBLIN-1703] avoid double quota increase for adhoc flows (apache#3550)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants