-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support writing to Pubsub with ordering key; Add PubsubMessage SchemaCoder #31608
base: master
Are you sure you want to change the base?
Conversation
Confirmed that ordering key is preserved with both direct runner and dataflow runner |
Thanks @ahmedabu98! At first glance, this approach seems massively preferable to the set of bespoke coders that already exist, and those future ones that might need to exist later. I’d be happy to take a closer look next week! |
Assigning reviewers. If you would like to opt out of this review, comment R: @damondouglas for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
if (getNeedsOrderingKey()) { | ||
pubsubMessages.setCoder(PubsubMessageSchemaCoder.getSchemaCoder()); | ||
} else { | ||
pubsubMessages.setCoder(new PubsubMessageWithTopicCoder()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fork is required to not break update compatibility
@iht and I have been looking into this today for a Dataflow customer and we came across a few details that seem to be missing on this PR:
The issue we're working on is time-sensitive so we're trying to wrap up our patches today.
A nice to have would be enabling users to customize the output sharding range based on ordering keys. Given the fact that throughput per ordering key is capped to 1 MBps (docs) I'd almost be inclined to say the ordering key should replace the output shard entirely. @ahmedabu98 I'm happy to share our changes in a bit and I'll set up a PR against the source branch of this PR. |
.../java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java
Outdated
Show resolved
Hide resolved
@sjvanrossum thank you for these insights, I'd be happy to take a look at your PR I'm not familiar with the internal implementation and how it relates to this one, but looks like we'd need changes there too. |
@scwhittle or @reuvenlax may be able to shed a light on Dataflow's implementation and the complexity of changes needed to accommodate this feature. Context: |
The DataflowRunner overrides the pubsub write transform using org.apache.beam.runners.dataflow.DataflowRunner.StreamingPubsubIOWrite so org.apache.beam.runners.dataflow.worker.PubsubSink is used. It would be nice to prevent using the ordering key for now with the DataflowRunner unless the experiment to use the beam implementation is present. To add support for it to Dataflow, it appears that if PUBSUB_SERIALIZED_ATTRIBUTES_FN is set, that maps bytes to PubsubMessage which already includes the ordering key. But for the ordering key to be respected for publishing, additional changes would be needed in the dataflow service backend. Currently it looks like it would just be dropped but if it was respected the service would also need to be updated to ensure batching doesn't occur across ordering keys.
Are you considering producing to a single ordering key from multiple distinct grouped-by keys in parallel? Doesn't that defeat the purpose of the ordering provided? I'm also not sure it would increase the throughput beyond the 1Mb per ordering key limit. An alternative would be grouping by partitioning of the ordering keys (via deterministic hash buckets for example) and then batching just within a bundle. |
Agreed, I'll throw an exception when the
Agreed, I'll create a new bug for this to continue this discussion internally.
The initial patch I wrote concatenated topic and ordering key and left output shards unchanged.
|
Reminder, please take a look at this pr: @damondouglas @shunping |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @robertwb for label java. Available commands:
|
Highlighting this here as well, while trying to retrofit ordering keys onto the existing sinks I thought of rewriting the sink using While writing that sink I stumbled on some issues regarding message size validation as documented in #31800.
My thoughts on fixing the validation issue is to introduce a Coincidentally, the revised batching mechanism I had imagined turns out to be very close to the implementation found in Google Cloud Pub/Sub Client for Java (https://github.com/googleapis/java-pubsub/blob/main/google-cloud-pubsub/src/main/java/com/google/cloud/pubsub/v1/Publisher.java) and would live in @ahmedabu98 the fixes to the batching mechanism should address the comments you had raised on ahmedabu98#427 about my use of variable assignments in the condition of an if statement so I'll get those commits added to that PR. |
I saw @scwhittle @egalpin already entered some ideas. Do you plan to finish the review in the near future? If not available I can do a first pass. I see this new feature is guarded by a flag so won't affect existing uses if the flag is not set. So the current change looks fairly safe to get in. |
I'll have the batching fix added to ahmedabu98#427 before US business hours start tomorrow and I'll defer the rest to separate PRs. 👍 |
Reminder, please take a look at this pr: @damondouglas @damondouglas |
I think using the pubsub provided client would be good to do and was wondering why we weren't using it. I'm guessing it might not have been available when the original Beam implementation was done. Perhaps this could go in with some minimal backoff sleeping to retry such errors and that can be done separately? |
Reminder, please take a look at this pr: @damondouglas @damondouglas |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @robertwb for label java. Available commands:
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @kennknowles for label java. Available commands:
|
Reminder, please take a look at this pr: @kennknowles @damondouglas |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @Abacn for label java. Available commands:
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @damondouglas for label java. Available commands:
|
Reminder, please take a look at this pr: @damondouglas @chamikaramj |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @kennknowles for label java. Available commands:
|
Reminder, please take a look at this pr: @kennknowles @Abacn |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @robertwb for label java. Available commands:
|
Reminder, please take a look at this pr: @robertwb @chamikaramj |
Fixes #21162
I wasn't able to use the existing
PubsubMessageWithAttributesAndMessageIdCoder
because it doesn't encode/decode the message's topic, which is needed for dynamic destinations. There are already a number of existing coders (6) developed over the years. Every time a new feature/parameter is added to PubsubMessage, we need to make a new coder and fork the code to maintain update compatibility.To mitigate this for the future, this PR introduces a SchemaCoder for PubsubMessage. SchemaCoder allows us to evolve the schema over time, so hopefully new features can be added in the future without breaking update compatibility.
Note that PubsubMessage's default coder is
PubsubMessageWithAttributesCoder
, which can't be updated without breaking backwards compatibility (see #23525). Wherever PubsubMessages are created in a pipeline, we would have to manually override the coder toPubsubMessageSchemaCoder.getSchemaCoder()
or the ordering key will get lost.