Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BQ typed support #5529

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Improve BQ typed support #5529

wants to merge 7 commits into from

Conversation

RustedBones
Copy link
Contributor

Leverages upstream changes available for BQ:
Annotated BQ typed avro translation leverages logical-type to have symmetric read/write. This fixes integration testfailure introduced in #5523

Base automatically changed from beam-2.61 to main November 27, 2024 14:55
@RustedBones RustedBones marked this pull request as ready for review November 27, 2024 14:56
case t if t =:= typeOf[LocalTime] =>
q"_root_.com.spotify.scio.bigquery.Time($tree)"
q"_root_.com.spotify.scio.bigquery.Time.micros($tree)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it was returning String before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, by changing this, I created a behavior change.

When extracting from avro, we disabled the logical types -> those types are returned as STRING (mentioned here). This is very annoying because the same record can't be written back in the table.

Setting useAvroLogicalTypes on the IO enables that. This will however be breaking for all pipelines reading BQ DATE, TIME and DATETIME that expects string and will get logical type.

This is also not in sync with storage API that uses logical types for such columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to keep the same behavior.
Introduced new Format.GenericRecordWithLogicalType to enable BQ avro read with desired setup.
Change BigQueryTyped.Table to always use logical types

@RustedBones RustedBones force-pushed the fix-typed-bq-avro branch 2 times, most recently from 35a1b7e to 158c9bf Compare November 28, 2024 09:53
@@ -203,7 +203,7 @@ final class SCollectionGenericRecordOps[T <: GenericRecord](private val self: SC
self
.covary[GenericRecord]
.write(
BigQueryTypedTable(table, Format.GenericRecord)(
BigQueryTypedTable(table, Format.GenericRecordWithLogicalTypes)(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On write, logical types were enabled by default

@@ -334,7 +334,7 @@ private[types] object TypeProvider {
q"override def schema: ${p(c, GModel)}.TableSchema = ${p(c, SUtil)}.parseSchema(${schema.toString})"
}
val defAvroSchema =
q"override def avroSchema: org.apache.avro.Schema = ${p(c, BigQueryUtils)}.toGenericAvroSchema(${cName.toString}, this.schema.getFields)"
q"override def avroSchema: org.apache.avro.Schema = ${p(c, BigQueryUtils)}.toGenericAvroSchema(this.schema, true)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logical types by default for typed BQ avro

Comment on lines +755 to +757
val schemaFactory = Functions.serializableFn[TableSchema, org.apache.avro.Schema] { _ =>
BigQueryType[T].avroSchema
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use strict schema from the annotated class instead of relying on the table conversion

Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 54.28571% with 16 lines in your changes missing coverage. Please review.

Project coverage is 61.38%. Comparing base (a1fce09) to head (6da6ac2).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...n/scala/com/spotify/scio/bigquery/BigQueryIO.scala 36.36% 14 Missing ⚠️
...la/com/spotify/scio/bigquery/client/TableOps.scala 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5529      +/-   ##
==========================================
- Coverage   61.42%   61.38%   -0.04%     
==========================================
  Files         312      312              
  Lines       11104    11116      +12     
  Branches      757      770      +13     
==========================================
+ Hits         6821     6824       +3     
- Misses       4283     4292       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants