[Managed Iceberg] Support partitioning time types (year, month, day, hour) #32939

ahmedabu98 · 2024-10-24T20:00:55Z

ahmedabu98 · 2024-10-24T20:02:11Z

@DanielMorales9 can you take a look at this one too?

github-actions · 2024-10-24T21:06:50Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

DanielMorales9 · 2024-10-25T10:29:40Z

I think a more scalable approach here would be to encapsulate the writing logic within a Parquet writer class. This would be similar to how Spark or Flink handle Parquet writes (e.g., i.e. SparkParquetWriter, FlinkParquetWriter), allowing you to manage the type conversions and partitioning requirements specific to Iceberg in a centralized and reusable way.

ahmedabu98 · 2024-10-25T13:01:22Z

We have a relatively thin RecordWriter wrapper that uses Parquet and Avro writers. A RecordWriter is blind to its data file's partition key and spec.

There's one RecordWriter for each destination-partition pair, and RecordWriterManager takes care of routing records to the correct destination and partition. If it helps, we can certainly move the new partition logic in this PR to RecordWriterManager. I can see that it belongs there more than in utils.

ahmedabu98 · 2024-11-05T11:43:04Z

hey @DanielMorales9, would you like to take a look? the next Beam release is getting cut next week if we wanna get this in before then.

DanielMorales9 · 2024-11-05T12:50:08Z

Hey @ahmedabu98, I am a bit concerned with the additional overhead you are introducing by recreating all records just to fit the types expected by the Iceberg partitioning logic. Imo, it should be done in a single place rather than scattered everywhere (such as in the timestamp issue we discussed some time ago).

ahmedabu98 · 2024-11-05T15:20:18Z

Agreed it's not the most ideal. I expected all these conversions to be taken care of behind the scenes by Iceberg's partitioning logic, but looks like for time types we need to it on our side instead.

recreating all records just to fit the types expected by the Iceberg partitioning logic

We're not really recreating the records (i.e. not doing a fully copy). We're just creating an empty record and filling in the fields that we're partitioning on (ref). I found this to be the minimal implementation needed to make it work. In the average case, we should expect only a few fields to be populated, not the full record.

I'm open to suggestions though! hope I'm missing something

…erg_time_partitioning

CHANGES.md

ahmedabu98 · 2024-12-08T15:42:00Z

@DanielMorales9 others are beginning to ask for this feature as well. I'll try to find other reviewers in case you don't have time to look at this

ahmedabu98 · 2024-12-08T15:42:10Z

assign set of reviewers

github-actions · 2024-12-08T15:44:12Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @Abacn for label build.
R: @shunping for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

…erg_time_partitioning

kennknowles

This more generally adds the partition fields functionality, right? It isn't just adding date/time support. Or am I misunderstanding?

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/RecordWriterManager.java

CHANGES.md

ahmedabu98 · 2024-12-16T17:33:55Z

This more generally adds the partition fields functionality, right? It isn't just adding date/time support. Or am I misunderstanding?

Partitioning support was added in #32102 and already works with most fields (e.g. integer, float, string, etc.), but further handling is required for some time types. I decided to make most of this handling uniform for all types.

svetakvsundhar

LGTM from my end

support partitioning time types

83b255f

github-actions bot added java build io labels Oct 24, 2024

add to changes

95258a3

spotless

5bacc00

relocate logic to RecordWriterManager

f11053a

ahmedabu98 added 2 commits November 5, 2024 13:18

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

b8d3343

…erg_time_partitioning

fix test

cbae083

svetakvsundhar reviewed Dec 5, 2024

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

github-actions bot added the Next Action: Reviewers label Dec 8, 2024

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

fd03fb7

…erg_time_partitioning

kennknowles approved these changes Dec 13, 2024

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/RecordWriterManager.java Outdated Show resolved Hide resolved

Abacn reviewed Dec 13, 2024

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

ahmedabu98 and others added 2 commits December 16, 2024 12:42

address comments

fffb8da

Merge branch 'master' into iceberg_time_partitioning

3f1611d

ahmedabu98 added this to the 2.62.0 Release milestone Dec 16, 2024

svetakvsundhar approved these changes Dec 16, 2024

View reviewed changes

ahmedabu98 merged commit 286e29c into apache:master Dec 17, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Managed Iceberg] Support partitioning time types (year, month, day, hour) #32939

[Managed Iceberg] Support partitioning time types (year, month, day, hour) #32939

ahmedabu98 commented Oct 24, 2024

ahmedabu98 commented Oct 24, 2024

github-actions bot commented Oct 24, 2024

DanielMorales9 commented Oct 25, 2024 •

edited

Loading

ahmedabu98 commented Oct 25, 2024 •

edited

Loading

ahmedabu98 commented Nov 5, 2024

DanielMorales9 commented Nov 5, 2024

ahmedabu98 commented Nov 5, 2024

ahmedabu98 commented Dec 8, 2024

ahmedabu98 commented Dec 8, 2024

github-actions bot commented Dec 8, 2024

kennknowles left a comment

ahmedabu98 commented Dec 16, 2024

svetakvsundhar left a comment

[Managed Iceberg] Support partitioning time types (year, month, day, hour) #32939

[Managed Iceberg] Support partitioning time types (year, month, day, hour) #32939

Conversation

ahmedabu98 commented Oct 24, 2024

ahmedabu98 commented Oct 24, 2024

github-actions bot commented Oct 24, 2024

DanielMorales9 commented Oct 25, 2024 • edited Loading

ahmedabu98 commented Oct 25, 2024 • edited Loading

ahmedabu98 commented Nov 5, 2024

DanielMorales9 commented Nov 5, 2024

ahmedabu98 commented Nov 5, 2024

ahmedabu98 commented Dec 8, 2024

ahmedabu98 commented Dec 8, 2024

github-actions bot commented Dec 8, 2024

kennknowles left a comment

Choose a reason for hiding this comment

ahmedabu98 commented Dec 16, 2024

svetakvsundhar left a comment

Choose a reason for hiding this comment

DanielMorales9 commented Oct 25, 2024 •

edited

Loading

ahmedabu98 commented Oct 25, 2024 •

edited

Loading