Pass original message down through conversion for storage write api #31106

johnjcasey · 2024-04-25T17:23:40Z

Enable users to specify an alternate way to generate the table row for the error output for BQIO's storage write api.

The user passes in a function of ElementT -> TableRow, and we maintain an index of the original elements passed in to BQIO. If the function exists, we use it to generate the error row, instead of the default behavior of emitting the failure directly.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2024-04-25T18:34:40Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions · 2024-04-26T15:34:31Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @Abacn for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

…or handling

johnjcasey · 2024-05-01T18:50:01Z

@Abacn @ahmedabu98 could you take a look at this?

Abacn

Thanks, the change lgtm. Have one thing to confirm (cc'd below)

...d-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiSinkFailedRowsIT.java

Abacn · 2024-05-02T14:56:05Z

...google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiLoads.java

@@ -52,16 +52,18 @@
 /** This {@link PTransform} manages loads into BigQuery using the Storage API. */
 public class StorageApiLoads<DestinationT, ElementT>
    extends PTransform<PCollection<KV<DestinationT, ElementT>>, WriteResult> {
-  final TupleTag<KV<DestinationT, StorageApiWritePayload>> successfulConvertedRowsTag =
-      new TupleTag<>("successfulRows");
+  final TupleTag<KV<DestinationT, KV<ElementT, StorageApiWritePayload>>>


Here it changed PTransform output element type. Do we need some change in BigQueryTranslation to preserve upgrade compatibility? cc: @chamikaramj

or is there plan to setup precommit test for bigquery pipeline upgrade? so tests can auto detect this (like kafka upgrade project)

in theory, its within the overall BQ transform, so it might work?

I think changing output element type and the coder here could break streaming update compatibility in general.

Have you considered using the updateCompatibilityVersion option ?

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/options/StreamingOptions.java

Line 45 in c531f89

String getUpdateCompatibilityVersion();

That would require us to maintain two implementations of Streaming Inserts, one with this change, and one without, right? I think that would be prohibitive in general for beam IOs

This should be called out in CHANGES.md if we have to do these breaking changes. But I recommend updateCompatibilityVersion here.

Aside from update compatibility issues, doesn't this increase the data shuffled as we are now shuffling the writepayload and the original elements? If so it seems that we might want the previous behavior not just for older SDKs but also in cases where an error function requiring the original element is not configured.

Do we need to change the output type for successful writes? It seems like the original element is just being added for error handling path.

@reuvenlax

If I understand correctly the graph is not changing here, just the encoding of the elements and we're going from StorageApiWritePayload to KV<E, StorageApiWritePayload>. Would it be possible to have a special coder for KV<E, StorageApiWritePayload> such that it handles decoding previously coded StorageApiWritePayload as KV<null, StorageApiWritePayload>?
It seems like that could be done via a Schema in some way since StorageApiWritePayload uses autovalue schema coder. I think the dataflow backend would note that the schema is compatible in that case and allow the update to proceed.

Or a simpler route, could the element just be added as a nullable field to StorageApiWritePayload instead of changing to KV<E, StorageApiWritePayload>?

To share code, could we just switch to the new type throughout and have the element be null if not needed or missing due to previously encoded? Since the new type is a superset of the old, it seems like the compatibility with previous sdks could be kept to the boundaries of the impl (if the above doesn't let you share completely).

The idea of a coder that handles both decodings is very interesting to me. We could use the compat flag to indicate using the old coder, instead of the new coder, which would preserve backwards compatibility much better. I'll try that, as I think thats a good way to do this

If they use the new feature, we use one coder. If they don't, we use the old coder

this wouldn't even require the update compat flag i think actually

As a side comment, this is another motivation to use schema coders more ubiquitously--adding another field is update compatible.

On another note, anything that involves shuffling more data in the main data path should be looked at carefully from a perf standpoint. We've gone to a lot of effort (e.g. with dynamic destinations) to ensure shuffling metadata doesn't become a perf impediment.

Abacn · 2024-05-02T14:58:54Z

Also going to run some load test to see if it has performance implications

update:

"AvgInputThroughputElementsPerSec": 51674.9203125,

identical to 2.55.0 (51205), 2.56.0 (47579)

ahmedabu98

LGTM as well. Just one suggestion for performance.

P.S. I see @Abacn's load test results, feel free to ignore

...ud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiConvertMessages.java

…rce-record-storage-write

…sion

…me, to enable updates

reuvenlax · 2024-05-21T18:37:46Z

If I understand this correctly, we are now propagating both ElementT and StorageApiWritePayload - correct? Doesn't this double the amount of data being processed?

github-actions · 2024-06-01T12:13:52Z

Reminder, please take a look at this pr: @robertwb @Abacn

github-actions · 2024-06-05T12:13:51Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java.
R: @chamikaramj for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

robertwb · 2024-06-10T16:39:12Z

Looks like this was approved but has conflicts that need to be resolved.

chamikaramj · 2024-06-10T16:45:00Z

There was an unresolved discussion about maintaining update compatibility without duplicating a lot of code: #31106 (comment)

reuvenlax · 2024-06-10T17:31:41Z

I'm also wanting to know whether there was something motivating this change - i.e. is their a Beam user that currently needs this? In addition to being careful about perf, this PR adds quite a bit of complexity to code that is already fairly complex.

…

On Mon, Jun 10, 2024 at 9:51 AM Robert Bradshaw ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiLoads.java <#31106 (comment)>: > @@ -52,16 +52,18 @@ /** This ***@***.*** PTransform} manages loads into BigQuery using the Storage API. */ public class StorageApiLoads<DestinationT, ElementT> extends PTransform<PCollection<KV<DestinationT, ElementT>>, WriteResult> { - final TupleTag<KV<DestinationT, StorageApiWritePayload>> successfulConvertedRowsTag = - new TupleTag<>("successfulRows"); + final TupleTag<KV<DestinationT, KV<ElementT, StorageApiWritePayload>>> As a side comment, this is another motivation to use schema coders more ubiquitously--adding another field is update compatible. On another note, anything that involves shuffling more data in the main data path should be looked at carefully from a perf standpoint. We've gone to a lot of effort (e.g. with dynamic destinations) to ensure shuffling metadata doesn't become a perf impediment. — Reply to this email directly, view it on GitHub <#31106 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFAYJVNQ6T5NBLVCJN5CWDDZGXKQ5AVCNFSM6AAAAABGZJZPSWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMBYGMYDAOBRGI> . You are receiving this because you were mentioned.Message ID: <apache/beam/pull/31106/review/2108300812 ***@***.***>

github-actions · 2024-06-18T12:13:55Z

Reminder, please take a look at this pr: @damondouglas @chamikaramj

github-actions · 2024-06-21T12:13:48Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions · 2024-06-29T12:14:37Z

Reminder, please take a look at this pr: @kennknowles @ahmedabu98

github-actions · 2024-07-03T12:13:56Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java.
R: @Abacn for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions · 2024-07-10T12:14:11Z

Reminder, please take a look at this pr: @damondouglas @Abacn

damondouglas

I see a few unreplied questions/comments in this PR.

github-actions · 2024-07-15T12:14:07Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Pass original message down through conversion for storage write api

6f14e95

github-actions bot added java io gcp labels Apr 25, 2024

johnjcasey added 2 commits April 25, 2024 13:25

re-add run with plugin file

6c3ee4e

re-add run with plugin file

b020244

spotless

6e0603b

github-actions bot added the Next Action: Reviewers label Apr 26, 2024

Wire custom error transform function into Write Unsharded Records err…

d835d03

…or handling

johnjcasey marked this pull request as draft April 29, 2024 15:50

johnjcasey added 6 commits April 29, 2024 12:07

fix build errors

469d975

Wire custom error handling into sharded writes

a72ac72

add test cases for verifying that the error function is used

e095419

fix integration test cases

caafd2d

attempt to mitigate big bytes test failure

6131b77

add usage of user defined error handling to row size check

45653a7

johnjcasey marked this pull request as ready for review May 1, 2024 18:49

johnjcasey assigned Abacn and ahmedabu98 May 2, 2024

Abacn reviewed May 2, 2024

View reviewed changes

fix nullable import

65de02e

ahmedabu98 approved these changes May 6, 2024

View reviewed changes

...ud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiConvertMessages.java Outdated Show resolved Hide resolved

johnjcasey added 4 commits May 13, 2024 13:11

Merge remote-tracking branch 'origin/master' into feature/persist-sou…

14e6673

…rce-record-storage-write

Fork expansion of Storage Api Loads based on update compatibility ver…

8b5cc02

…sion

remove underscores in class names

d196102

fix underscore

f6c7a4f

johnjcasey added 2 commits May 14, 2024 14:34

move forking of storage api 256 to bqio, so the graph shape is the sa…

5237564

…me, to enable updates

spotless

c83f5c7

attempt schema evolution

fac4cde

github-actions bot added the slow-review label Jun 1, 2024

github-actions bot removed the slow-review label Jun 5, 2024

github-actions bot added the slow-review label Jun 18, 2024

github-actions bot removed the slow-review label Jun 21, 2024

github-actions bot added the slow-review label Jun 29, 2024

github-actions bot removed the slow-review label Jul 3, 2024

github-actions bot added the slow-review label Jul 10, 2024

damondouglas requested changes Jul 10, 2024

View reviewed changes

damondouglas added Next Action: Author and removed Next Action: Reviewers labels Jul 10, 2024

github-actions bot removed the slow-review label Jul 15, 2024

johnjcasey closed this Jul 16, 2024

ahmedabu98 mentioned this pull request Aug 6, 2024

Support withFormatRecordOnFailureFunction() for BigQuery STORAGE_WRITE_API and STORAGE_API_AT_LEAST_ONCE methods #31659

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass original message down through conversion for storage write api #31106

Pass original message down through conversion for storage write api #31106

johnjcasey commented Apr 25, 2024 •

edited by AnandInguva

Loading

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 26, 2024

johnjcasey commented May 1, 2024

Abacn left a comment

Abacn May 2, 2024

johnjcasey May 2, 2024

chamikaramj May 6, 2024

johnjcasey May 6, 2024

liferoad May 6, 2024

scwhittle May 21, 2024

johnjcasey May 21, 2024

johnjcasey May 21, 2024

johnjcasey May 21, 2024

robertwb Jun 10, 2024

Abacn commented May 2, 2024 •

edited

Loading

ahmedabu98 left a comment •

edited

Loading

reuvenlax commented May 21, 2024

github-actions bot commented Jun 1, 2024

github-actions bot commented Jun 5, 2024

robertwb commented Jun 10, 2024

chamikaramj commented Jun 10, 2024

reuvenlax commented Jun 10, 2024 via email

github-actions bot commented Jun 18, 2024

github-actions bot commented Jun 21, 2024

github-actions bot commented Jun 29, 2024

github-actions bot commented Jul 3, 2024

github-actions bot commented Jul 10, 2024

damondouglas left a comment

github-actions bot commented Jul 15, 2024

Pass original message down through conversion for storage write api #31106

Pass original message down through conversion for storage write api #31106

Conversation

johnjcasey commented Apr 25, 2024 • edited by AnandInguva Loading

GitHub Actions Tests Status (on master branch)

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 26, 2024

johnjcasey commented May 1, 2024

Abacn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abacn commented May 2, 2024 • edited Loading

ahmedabu98 left a comment • edited Loading

Choose a reason for hiding this comment

reuvenlax commented May 21, 2024

github-actions bot commented Jun 1, 2024

github-actions bot commented Jun 5, 2024

robertwb commented Jun 10, 2024

chamikaramj commented Jun 10, 2024

reuvenlax commented Jun 10, 2024 via email

github-actions bot commented Jun 18, 2024

github-actions bot commented Jun 21, 2024

github-actions bot commented Jun 29, 2024

github-actions bot commented Jul 3, 2024

github-actions bot commented Jul 10, 2024

damondouglas left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 15, 2024

johnjcasey commented Apr 25, 2024 •

edited by AnandInguva

Loading

Abacn commented May 2, 2024 •

edited

Loading

ahmedabu98 left a comment •

edited

Loading