Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add argument checks and tests for BQ StorageAPI sinks. #27213

Closed
wants to merge 5 commits into from
Closed

Add argument checks and tests for BQ StorageAPI sinks. #27213

wants to merge 5 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jun 21, 2023

Some argument checks are added to ensure the options specified by users are valid.

  • triggeringFrequency is only relevant when using STORAGE_WRITE_API (or FILE_LOADS) on an unbounded data source.
  • numFileShards is only relevant when using FILE_LOADS on an unbounded data source. It will be ignored if autoSharding is enabled.
  • numStorageWriteApiStreams is only relevant when using STORAGE_WRITE_API on an unbounded data source. It will be ignored if autoSharding is enabled.
  • autoSharding is only relevant on unbounded data source.

In addition, integration tests on a small-scale unbounded data source have been added. In these tests, we do fuzz testing on the schema order by shuffling fields in the BigQuery write schema. We also trigger an update to the table schema while data streaming is ongoing, to simulate a potential data-loss problem due to BigQuery schema change.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@ghost ghost marked this pull request as draft June 21, 2023 20:00
@ghost ghost marked this pull request as ready for review June 22, 2023 12:46
@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @lostluck added as fallback since no labels match configuration

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

shunping added 2 commits June 22, 2023 11:01
Internally, we will decide whether to call withSchema() with a schema
of shuffled fields based on this option.
* Fix a few typos on the method name STORAGE_WRITE_API
* Change the warning message when both numStorageWriteApiStreams and autoSharding are set. In this case, autoSharding takes priority.
* Add an argument check for using both numFileShards and autoSharding via FILE_LOADS.
@ghost
Copy link
Author

ghost commented Jun 22, 2023

Run Java_GCP_IO_Direct PreCommit

@johnjcasey johnjcasey requested a review from reuvenlax June 23, 2023 14:33
@github-actions
Copy link
Contributor

github-actions bot commented Jul 1, 2023

Reminder, please take a look at this pr: @lostluck

@github-actions
Copy link
Contributor

github-actions bot commented Jul 5, 2023

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @lostluck added as fallback since no labels match configuration

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks shunping! i left some comments
one general nit: Some of the ...but the collection was {} and the method was {} warnings are a bit verbose and give too much information, which may be confusing when debugging. I think some could be more concise and include only the relevant mismatch.

.execute();
LOG.info("Successfully updated the schema of table: " + tableId);
} catch (Exception e) {
LOG.debug("Exceptions caught when updating table schema: " + e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity should maybe be raised to info/warning for visibility?

Comment on lines +347 to +351
Collections.shuffle(fieldNamesShuffled, new Random(RANDOM_SEED));

// The updated schema includes all fields in the original schema plus a random new field
List<String> fieldNamesWithExtra = new ArrayList<String>(fieldNamesOrigin);
Random r = new Random(RANDOM_SEED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RANDOM_SEED is constant, wouldn't this result in the same "randomly" generated numbers for each run?

Comment on lines +3218 to +3225
} else if (method != Method.STREAMING_INSERTS) {
LOG.warn(
"The setting of auto-sharding is ignored. It is only supported when writing an"
+ " unbounded PCollection via FILE_LOADS, STREAMING_INSERTS or"
+ " STORAGE_WRITE_API, but the collection was {} and the method was {}.",
input.isBounded(),
method);
}
Copy link
Contributor

@ahmedabu98 ahmedabu98 Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point it makes more sense to check method == STORAGE_API_AT_LEAST_ONCE? and have a warning message specific for that method. also needs a check that autosharding is enabled here

Comment on lines +3156 to +3158
"The setting of storageApiTriggeringFrequency in BigQueryOptions is ignored."
+ " It is only supported when writing an unbounded PCollection via"
+ " STORAGE_WRITE_API, but the collection was {} and the method was {}.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could be more concise; using STORAGE_WRITE_API isn't the problem here

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left another comment. Also on another look, I think we need a few more tests to:

  • verify that unknown values are ignored
  • verify that beam responds well when schema field types are also updated (e.g. field going from "required" to "nullable")

try {
rowCount = this.getRowCount(this.projectId + "." + this.datasetId + "." + this.tableId);
} catch (Exception e) {
LOG.error(e.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. maybe we should rethrow the error here instead of just logging it? if getRowCount starts failing for whatever reason, rowCount will stay at 0 and the schema will never be updated, making the tests obsolete (since the purpose is to test the schema update feature).

@lostluck
Copy link
Contributor

It looks like this is already getting a review, so I'm going to bow out (I've been on vacation for most of June, hence no actions.)

But if you need a 2nd look or a committer for merging, please feel free to ping me again.

@lostluck
Copy link
Contributor

R: @ahmedabu98

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@github-actions
Copy link
Contributor

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 10, 2023
@github-actions
Copy link
Contributor

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants