[Managed Iceberg] unbounded source #33504

ahmedabu98 · 2025-01-06T18:16:10Z

Unbounded (streaming) source for Managed Iceberg.

See design doc for high level overview: https://s.apache.org/beam-iceberg-incremental-source

…erg_streaming_source

github-actions · 2025-01-30T22:06:51Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

ahmedabu98 · 2025-02-03T15:51:02Z

R: @kennknowles
R: @regadas

Can y'all take a look? I still have to write some tests, but it's at a good spot for a first round of reviews. I ran a bunch of pipelines (w/Legacy DataflowRunner) at different scales and the throughput/scalability looks good.

github-actions · 2025-02-03T15:52:21Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

…erg_streaming_source

kennknowles

Overall, I think all the pieces are in the right place. Just a question about why an SDF is the way it is and a couple code-level comments.

This seems like something you want to test a lot of different ways before it gets into a release. Maybe get another set of eyes like @chamikaramj or @Abacn too. But I'm approving and leaving to your judgment.

sdks/java/io/iceberg/bqms/build.gradle

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SnapshotRange.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadTask.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadTaskDescriptor.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromGroupedTasks.java

kennknowles

Wait actually I forgot I want to have the discussion about the high level toggle between incremental scan source and bounded source.

…erg_streaming_source

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromGroupedTasks.java

…erg_streaming_source

…rk progress; convert GiB output iterable to list because of RunnerV2 bug

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

…ng' option; doc updates

…erg_streaming_source

chamikaramj

Thanks!

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadUtils.java

ahmedabu98 · 2025-03-05T18:37:01Z

@chamikaramj this is ready for another review

…erg_streaming_source

scwhittle · 2025-03-06T16:20:49Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IncrementalScanSource.java

+                .discardingFiredPanes())
+        .apply(
+            GroupIntoBatches.<ReadTaskDescriptor, ReadTask>ofByteSize(
+                    MAX_FILES_BATCH_BYTE_SIZE, ReadTask::getByteSize)


we don't really want these batches, we just want the read tasks distributed to workers without causing worker ooms. Otherwise we're just adding latency for the poll latency and not really benefitting from the batch.

Ideally we could change Redistribute to autoshard, but since it is tied to GroupIntoBatches currently, what about just doing GroupIntoBatches.ofSize(1).withShardedKey() ?

I initially figured that GroupIntoBatches.ofSize(1).withShardedKey() would give us too many concurrent shards, but after running I found it actually produces only 1 shard, and everything is processed sequentially. Same thing when I tried .ofByteSize(1)

GroupIntoBatches.ofSize(1).withShardedKey().withMaxBufferingDuration(pollInterval): 2025-03-07_08_57_21-15760437490773458424

GroupIntoBatches.ofByteSize(1).withShardedKey().withMaxBufferingDuration(pollInterval): 2025-03-07_09_04_50-7891042636475112191

I also tried mimicking GiB behavior by:

Associating a key to each read task. The key is incremented after reaching 4GB.

Adding a Redistribute.byKey() after CreateReadTasksDoFn to distribute read tasks into 4GB per streaming key

Read-and-drop did fine but the throughput was pretty spiky. Read + write took much longer than the current approach and didn't scale well: 2025-03-07_08_57_21-15760437490773458424

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IncrementalScanSource.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromTasks.java

scwhittle · 2025-03-06T16:42:37Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WatchForSnapshots.java

+
+      return isComplete
+          ? PollResult.complete(timestampedSnapshots) // stop at specified snapshot
+          : PollResult.incomplete(timestampedSnapshots); // continue forever


I think we want to generate a correct watermark here using
PollResult.withWatermark

Added watermark to the TimestampedValue above, as well as to individual read tasks outputted by CreateReadTasksDofn

chamikaramj

Thanks. LGTM.

chamikaramj · 2025-03-06T20:50:21Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

+ *     <td> {@code operation} </td>
+ *     <td> {@code string} </td>
+ *     <td>
+ *       The snapshot <a href="https://iceberg.apache.org/javadoc/0.11.0/org/apache/iceberg/DataOperations">operation</a> associated with this record. For now, only "append" is supported.


May be change to "APPEND" to be consistent with Iceberg.

The value is actually lowercase (see ref)

chamikaramj · 2025-03-06T20:57:59Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

+ *
+ * <p><b>Note</b>: This reads <b>append-only</b> snapshots. Full CDC is not supported yet.
+ *
+ * <p>The CDC <b>streaming</b> source (enabled with {@code streaming=true}) continuously polls the


We should validate (and fail) somewhere if the "streaming" flag is set here and the streaming PipelineOption [1] is not set.

[1]

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/options/StreamingOptions.java

Line 38 in c1d0fa4

void setStreaming(boolean value);

Done (although note that this is automatically true if there's an unbounded PCollection)

Actually the flag isn't automatically applied with DirectRunner. I'd prefer to remove this check because it makes this less usable, it forces users to do something like pipeline.getOptions().as(StreamingOptions.class).setStreaming(true);. We also don't have any precedent for enforcing this on our current unbounded sources.

chamikaramj · 2025-03-06T21:55:24Z

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java

@@ -108,6 +110,7 @@ public class Managed {
   *
   * <ul>
   *   <li>{@link Managed#ICEBERG} : Read from Apache Iceberg tables
+   *   <li>{@link Managed#ICEBERG_CDC} : CDC Read from Apache Iceberg tables


We should link to locations where users can find additional Javadocs related to each of these options (also for write).

Added links to the connectors.

Unfortunately only IcebergIO has fleshed-out documentation for Managed configuration parameters. Perhaps we should create a central location on the Beam website that displays these configuration options (similar to Dataflow: iceberg, kafka, bigquery)

…remove window step; add --strea ming=true validation; add IO links to Managed java doc

initial

bb87511

github-actions bot added java io labels Jan 6, 2025

ahmedabu98 marked this pull request as draft January 6, 2025 18:16

ahmedabu98 added 7 commits January 7, 2025 15:50

let CombinedScanTask do splitting (based on Parquet row groups)

853de4d

perf improv

69fd988

create one read task descriptor per snapshot range

da2f33f

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

73c8992

…erg_streaming_source

some improvements

81ca709

use GiB for streaming, Redistribute for batch; update docs

e319d76

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

c25cd75

…erg_streaming_source

ahmedabu98 marked this pull request as ready for review January 30, 2025 21:09

use static value

af1ec85

ahmedabu98 added 3 commits February 3, 2025 16:27

add some test

f5d3268

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

df40239

…erg_streaming_source

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

43ab88f

…erg_streaming_source

kennknowles approved these changes Feb 5, 2025

View reviewed changes

kennknowles requested changes Feb 5, 2025

View reviewed changes

ahmedabu98 added 3 commits February 7, 2025 08:32

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

20db0ee

…erg_streaming_source

add a java doc; don't use static block to create coder

622625f

spotless

4c25d3f

kennknowles approved these changes Feb 11, 2025

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java Outdated Show resolved Hide resolved

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromGroupedTasks.java Outdated Show resolved Hide resolved

ahmedabu98 added 2 commits February 13, 2025 12:35

add options: from/to timestamp, starting strategy, and streaming toggle

8666166

trigger integration tests

297c309

github-actions bot added the build label Feb 13, 2025

ahmedabu98 added 2 commits February 13, 2025 13:47

small test fix

5e3a2cc

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

8b131fd

…erg_streaming_source

scan every snapshot individually; use snapshot commit timestamp to ma…

887eff1

…rk progress; convert GiB output iterable to list because of RunnerV2 bug

chamikaramj reviewed Mar 3, 2025

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java Show resolved Hide resolved

ahmedabu98 added 2 commits March 3, 2025 14:35

new schematransform for cdc streaming; add watermark configs

6cfc2d8

cleanup

fbad86e

github-actions bot added the model label Mar 3, 2025

ahmedabu98 added 5 commits March 3, 2025 15:02

add guava import

50f9497

remove iceberg_cdc_read from xlang auto-wrapper gen

4f1f40b

fix javadoc

633365c

cleanup

37485f1

spotless

4ede0e8

ahmedabu98 mentioned this pull request Mar 4, 2025

[Task]: Close feature gaps between regular and CDC Iceberg sources #34168

Open

17 tasks

ahmedabu98 added 4 commits March 4, 2025 14:50

use CDC schema for batch and streaming; re-introduce boolean 'streami…

db9fd63

…ng' option; doc updates

add to CHANGES.md and discussion docs

79ab16a

spotless

06a4cee

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

132034f

…erg_streaming_source

chamikaramj reviewed Mar 5, 2025

View reviewed changes

ahmedabu98 added 3 commits March 5, 2025 10:40

address review comments about java docs

795c87c

remove raw guava dep

c6461c9

add another test for read utils

7dbf3e1

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

5263a13

…erg_streaming_source

scwhittle requested changes Mar 6, 2025

View reviewed changes

chamikaramj reviewed Mar 6, 2025

View reviewed changes

ahmedabu98 mentioned this pull request Mar 7, 2025

[Task]: Add Python AfterSynchronizedProcessingTime trigger and add an Iceberg CDC streaming read test #34212

Open

17 tasks

ahmedabu98 added 3 commits March 7, 2025 10:56

use cached schemas

40fe4ab

watermark based on snapshot timestamp; remove infinite allowed skew; …

db3c570

…remove window step; add --strea ming=true validation; add IO links to Managed java doc

remove check

7078f20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Managed Iceberg] unbounded source #33504

[Managed Iceberg] unbounded source #33504

ahmedabu98 commented Jan 6, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025

ahmedabu98 commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

kennknowles left a comment

kennknowles left a comment

chamikaramj left a comment

ahmedabu98 commented Mar 5, 2025

scwhittle Mar 6, 2025

ahmedabu98 Mar 7, 2025 •

edited

Loading

ahmedabu98 Mar 10, 2025 •

edited

Loading

scwhittle Mar 6, 2025

ahmedabu98 Mar 10, 2025

chamikaramj left a comment

chamikaramj Mar 6, 2025

ahmedabu98 Mar 10, 2025

chamikaramj Mar 6, 2025

ahmedabu98 Mar 10, 2025

ahmedabu98 Mar 10, 2025 •

edited

Loading

chamikaramj Mar 6, 2025

ahmedabu98 Mar 10, 2025

[Managed Iceberg] unbounded source #33504

Are you sure you want to change the base?

[Managed Iceberg] unbounded source #33504

Conversation

ahmedabu98 commented Jan 6, 2025 • edited Loading

github-actions bot commented Jan 30, 2025

ahmedabu98 commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

kennknowles left a comment

Choose a reason for hiding this comment

kennknowles left a comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

ahmedabu98 commented Mar 5, 2025

Choose a reason for hiding this comment

ahmedabu98 Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

ahmedabu98 Mar 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmedabu98 Mar 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmedabu98 commented Jan 6, 2025 •

edited

Loading

ahmedabu98 Mar 7, 2025 •

edited

Loading

ahmedabu98 Mar 10, 2025 •

edited

Loading

ahmedabu98 Mar 10, 2025 •

edited

Loading