Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Managed Iceberg] unbounded source #33504

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Jan 6, 2025

Unbounded (streaming) source for Managed Iceberg.

See design doc for high level overview: https://s.apache.org/beam-iceberg-incremental-source

Fixes #33092

@ahmedabu98 ahmedabu98 marked this pull request as draft January 6, 2025 18:16
@ahmedabu98 ahmedabu98 marked this pull request as ready for review January 30, 2025 21:09
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@ahmedabu98
Copy link
Contributor Author

R: @kennknowles
R: @regadas

Can y'all take a look? I still have to write some tests, but it's at a good spot for a first round of reviews. I ran a bunch of pipelines (w/Legacy DataflowRunner) at different scales and the throughput/scalability looks good.

Copy link
Contributor

github-actions bot commented Feb 3, 2025

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think all the pieces are in the right place. Just a question about why an SDF is the way it is and a couple code-level comments.

This seems like something you want to test a lot of different ways before it gets into a release. Maybe get another set of eyes like @chamikaramj or @Abacn too. But I'm approving and leaving to your judgment.

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait actually I forgot I want to have the discussion about the high level toggle between incremental scan source and bounded source.

@github-actions github-actions bot added the build label Feb 13, 2025
…rk progress; convert GiB output iterable to list because of RunnerV2 bug
@github-actions github-actions bot added the model label Mar 3, 2025
Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ahmedabu98
Copy link
Contributor Author

@chamikaramj this is ready for another review

.discardingFiredPanes())
.apply(
GroupIntoBatches.<ReadTaskDescriptor, ReadTask>ofByteSize(
MAX_FILES_BATCH_BYTE_SIZE, ReadTask::getByteSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't really want these batches, we just want the read tasks distributed to workers without causing worker ooms. Otherwise we're just adding latency for the poll latency and not really benefitting from the batch.

Ideally we could change Redistribute to autoshard, but since it is tied to GroupIntoBatches currently, what about just doing GroupIntoBatches.ofSize(1).withShardedKey() ?

Copy link
Contributor Author

@ahmedabu98 ahmedabu98 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially figured that GroupIntoBatches.ofSize(1).withShardedKey() would give us too many concurrent shards, but after running I found it actually produces only 1 shard, and everything is processed sequentially. Same thing when I tried .ofByteSize(1)

Copy link
Contributor Author

@ahmedabu98 ahmedabu98 Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried mimicking GiB behavior by:

  • Associating a key to each read task. The key is incremented after reaching 4GB.
  • Adding a Redistribute.byKey() after CreateReadTasksDoFn to distribute read tasks into 4GB per streaming key

Read-and-drop did fine but the throughput was pretty spiky. Read + write took much longer than the current approach and didn't scale well: 2025-03-07_08_57_21-15760437490773458424


return isComplete
? PollResult.complete(timestampedSnapshots) // stop at specified snapshot
: PollResult.incomplete(timestampedSnapshots); // continue forever
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to generate a correct watermark here using
PollResult.withWatermark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added watermark to the TimestampedValue above, as well as to individual read tasks outputted by CreateReadTasksDofn

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.

* <td> {@code operation} </td>
* <td> {@code string} </td>
* <td>
* The snapshot <a href="https://iceberg.apache.org/javadoc/0.11.0/org/apache/iceberg/DataOperations">operation</a> associated with this record. For now, only "append" is supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be change to "APPEND" to be consistent with Iceberg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is actually lowercase (see ref)

*
* <p><b>Note</b>: This reads <b>append-only</b> snapshots. Full CDC is not supported yet.
*
* <p>The CDC <b>streaming</b> source (enabled with {@code streaming=true}) continuously polls the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should validate (and fail) somewhere if the "streaming" flag is set here and the streaming PipelineOption [1] is not set.

[1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (although note that this is automatically true if there's an unbounded PCollection)

Copy link
Contributor Author

@ahmedabu98 ahmedabu98 Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the flag isn't automatically applied with DirectRunner. I'd prefer to remove this check because it makes this less usable, it forces users to do something like pipeline.getOptions().as(StreamingOptions.class).setStreaming(true);. We also don't have any precedent for enforcing this on our current unbounded sources.

@@ -108,6 +110,7 @@ public class Managed {
*
* <ul>
* <li>{@link Managed#ICEBERG} : Read from Apache Iceberg tables
* <li>{@link Managed#ICEBERG_CDC} : CDC Read from Apache Iceberg tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should link to locations where users can find additional Javadocs related to each of these options (also for write).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added links to the connectors.

Unfortunately only IcebergIO has fleshed-out documentation for Managed configuration parameters. Perhaps we should create a central location on the Beam website that displays these configuration options (similar to Dataflow: iceberg, kafka, bigquery)

…remove window step; add --strea

ming=true validation; add IO links to Managed java doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: {Managed IO Iceberg} - Allow users to run streaming reads
4 participants