Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking]: Performance issue of PubSubIO on non-Dataflow runner #31510

Open
2 of 16 tasks
Abacn opened this issue Jun 5, 2024 · 1 comment
Open
2 of 16 tasks

[Tracking]: Performance issue of PubSubIO on non-Dataflow runner #31510

Abacn opened this issue Jun 5, 2024 · 1 comment

Comments

@Abacn
Copy link
Contributor

Abacn commented Jun 5, 2024

What happened?

This issue is used to track Beam PubSubIO performance issue.

Per https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub Dataflow runner uses an internal implementation of PubSubIO. While the Beam's open-sourced PubsubIO is used for Direct Runner, Flink Runner, etc

It is observed that the Beam's implementation is less performant than the Dataflow one, which is as expected. However, the generic expectation is that the performance to be reasonable on other runners and we expect the SDK to be in production grade, which appears not the cases currently.

For example, there are reports of high ack and sent message counts when read from PubSub. It's not the high throughput use case. It's only 1 message per second. Yet, the number of "acks" is 14 times the number of published messages. And the number of "sent" is 6 times the number of published messages.

As a first step we should investigate why the messages are ack and published multiple times on, for example, the Flink runner.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@xzhang2sc
Copy link

I think the issue is how watermark is estimated in PubsubUnboundedSource. The assumption of Pubsub delivering the oldest message at least once a minute is quite fragile, the improperly progressed watermark will result in old messages delivered repeatedly.
The Dataflow watermark estimation is more sophisticated, and it utilizes the "oldest unacked publish timestamp" metric: https://stackoverflow.com/questions/42169004/what-is-the-watermark-heuristic-for-pubsubio-running-on-gcd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants