Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass original message down through conversion for storage write api #31106

Conversation

johnjcasey
Copy link
Contributor

@johnjcasey johnjcasey commented Apr 25, 2024

Enable users to specify an alternate way to generate the table row for the error output for BQIO's storage write api.

The user passes in a function of ElementT -> TableRow, and we maintain an index of the original elements passed in to BQIO. If the function exists, we use it to generate the error row, instead of the default behavior of emitting the failure directly.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@johnjcasey johnjcasey marked this pull request as draft April 29, 2024 15:50
@johnjcasey johnjcasey marked this pull request as ready for review May 1, 2024 18:49
@johnjcasey
Copy link
Contributor Author

@Abacn @ahmedabu98 could you take a look at this?

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the change lgtm. Have one thing to confirm (cc'd below)

@@ -52,16 +52,18 @@
/** This {@link PTransform} manages loads into BigQuery using the Storage API. */
public class StorageApiLoads<DestinationT, ElementT>
extends PTransform<PCollection<KV<DestinationT, ElementT>>, WriteResult> {
final TupleTag<KV<DestinationT, StorageApiWritePayload>> successfulConvertedRowsTag =
new TupleTag<>("successfulRows");
final TupleTag<KV<DestinationT, KV<ElementT, StorageApiWritePayload>>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it changed PTransform output element type. Do we need some change in BigQueryTranslation to preserve upgrade compatibility? cc: @chamikaramj

or is there plan to setup precommit test for bigquery pipeline upgrade? so tests can auto detect this (like kafka upgrade project)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory, its within the overall BQ transform, so it might work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing output element type and the coder here could break streaming update compatibility in general.

Have you considered using the updateCompatibilityVersion option ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require us to maintain two implementations of Streaming Inserts, one with this change, and one without, right? I think that would be prohibitive in general for beam IOs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called out in CHANGES.md if we have to do these breaking changes. But I recommend updateCompatibilityVersion here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from update compatibility issues, doesn't this increase the data shuffled as we are now shuffling the writepayload and the original elements? If so it seems that we might want the previous behavior not just for older SDKs but also in cases where an error function requiring the original element is not configured.

Do we need to change the output type for successful writes? It seems like the original element is just being added for error handling path.

@reuvenlax

If I understand correctly the graph is not changing here, just the encoding of the elements and we're going from StorageApiWritePayload to KV<E, StorageApiWritePayload>. Would it be possible to have a special coder for KV<E, StorageApiWritePayload> such that it handles decoding previously coded StorageApiWritePayload as KV<null, StorageApiWritePayload>?
It seems like that could be done via a Schema in some way since StorageApiWritePayload uses autovalue schema coder. I think the dataflow backend would note that the schema is compatible in that case and allow the update to proceed.

Or a simpler route, could the element just be added as a nullable field to StorageApiWritePayload instead of changing to KV<E, StorageApiWritePayload>?

To share code, could we just switch to the new type throughout and have the element be null if not needed or missing due to previously encoded? Since the new type is a superset of the old, it seems like the compatibility with previous sdks could be kept to the boundaries of the impl (if the above doesn't let you share completely).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of a coder that handles both decodings is very interesting to me. We could use the compat flag to indicate using the old coder, instead of the new coder, which would preserve backwards compatibility much better. I'll try that, as I think thats a good way to do this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they use the new feature, we use one coder. If they don't, we use the old coder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wouldn't even require the update compat flag i think actually

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side comment, this is another motivation to use schema coders more ubiquitously--adding another field is update compatible.

On another note, anything that involves shuffling more data in the main data path should be looked at carefully from a perf standpoint. We've gone to a lot of effort (e.g. with dynamic destinations) to ensure shuffling metadata doesn't become a perf impediment.

@Abacn
Copy link
Contributor

Abacn commented May 2, 2024

Also going to run some load test to see if it has performance implications

update:

"AvgInputThroughputElementsPerSec": 51674.9203125,

identical to 2.55.0 (51205), 2.56.0 (47579)

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. Just one suggestion for performance.

P.S. I see @Abacn's load test results, feel free to ignore

@reuvenlax
Copy link
Contributor

If I understand this correctly, we are now propagating both ElementT and StorageApiWritePayload - correct? Doesn't this double the amount of data being processed?

Copy link
Contributor

github-actions bot commented Jun 1, 2024

Reminder, please take a look at this pr: @robertwb @Abacn

Copy link
Contributor

github-actions bot commented Jun 5, 2024

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@robertwb
Copy link
Contributor

Looks like this was approved but has conflicts that need to be resolved.

@chamikaramj
Copy link
Contributor

There was an unresolved discussion about maintaining update compatibility without duplicating a lot of code: #31106 (comment)

@reuvenlax
Copy link
Contributor

reuvenlax commented Jun 10, 2024 via email

Copy link
Contributor

Reminder, please take a look at this pr: @damondouglas @chamikaramj

Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Copy link
Contributor

Reminder, please take a look at this pr: @kennknowles @ahmedabu98

Copy link
Contributor

github-actions bot commented Jul 3, 2024

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Copy link
Contributor

Reminder, please take a look at this pr: @damondouglas @Abacn

Copy link
Contributor

@damondouglas damondouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few unreplied questions/comments in this PR.

Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants