-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add integration tests for streaming Storage Write API (includes schema update feature) #27740
Conversation
Internally, we will decide whether to call withSchema() with a schema of shuffled fields based on this option.
* Fix a few typos on the method name STORAGE_WRITE_API * Change the warning message when both numStorageWriteApiStreams and autoSharding are set. In this case, autoSharding takes priority. * Add an argument check for using both numFileShards and autoSharding via FILE_LOADS.
Assigning reviewers. If you would like to opt out of this review, comment R: @Abacn for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Run Java_GCP_IO_Direct PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
R: @reuvenlax |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
It appears the added test do not work on dataflow runner v1: https://ci-beam.apache.org/view/PostCommit/job/beam_PostCommit_Java_DataflowV1/lastCompletedBuild/testReport/ should all these tests run on dataflow anyways? |
Oh - it looks like these tests require streaming engine.
…On Tue, Aug 8, 2023 at 8:54 PM Yi Hu ***@***.***> wrote:
It appears the added test do not work on dataflow runner v1:
https://ci-beam.apache.org/view/PostCommit/job/beam_PostCommit_Java_DataflowV1/lastCompletedBuild/testReport/
should all these tests run on dataflow anyways?
—
Reply to this email directly, view it on GitHub
<#27740 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVNSCX5BBK6QWGW6Y7TXUMCXFANCNFSM6AAAAAA23WQ35E>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This is in continuation of @shunping-google's work in #27213
These are integration tests that write to real BigQuery tables. The test pipeline writes records with a deliberate short timeout between each record so that the Storage API Stream has a chance to recognize the schema update. This PR also adds some warnings when invalid configurations are used (warnings instead of throwing exceptions so as not to break existing workflows. However if we ever do a refactor of this IO, we should turn these warnings into exceptions).
I've opted to not write tests for Batch writes that use the auto schema update feature because that use-case doesn't make much sense. These tests include both
STORAGE_WRITE_API
andSTORAGE_API_AT_LEAST_ONCE
, which use StorageApiWritesShardedRecords and StorageApiWriteUnshardedRecord, respectively. Batch writes also use theStorageApiWriteUnshardedRecord
transform, which manages stream appends and schema updates. This is to say that even though we don't have explicit Batch mode tests, that code path is broadly covered with these tests.UPDATE
Had to replace TestStream with PeriodicImpulse in followup: #27998. This is to allow tests to run on TestDataflowRunner