Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: BigQuery BatchLoad incompatible table schema error #25355

Closed
2 of 15 tasks
Abacn opened this issue Feb 7, 2023 · 10 comments · Fixed by #25410
Closed
2 of 15 tasks

[Bug]: BigQuery BatchLoad incompatible table schema error #25355

Abacn opened this issue Feb 7, 2023 · 10 comments · Fixed by #25410

Comments

@Abacn
Copy link
Contributor

Abacn commented Feb 7, 2023

What happened?

This bug is triggered when all of these condition met:

  1. Dynamical destination set
  2. The number of gcs file written is greater than 10,000 so that MultiPartitionsWriteTables is invoked.
  3. Final destination table already exists. The report has CREATE_NEVER

Then it may cause the temp table and final table having incompatible schema, regardless the schema is explicitly set or not.

error message:

Error message from worker: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_COPY_***_00000,
reached max retries: 3, last failed job: { "configuration" : { "copy" : { "createDisposition" : "CREATE_NEVER",
"destinationTable" : { "datasetId" : "***", "projectId" : "***", "tableId" : "***" }, ... "reason" : "invalid" } ], "state" : "DONE" },

org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:200) 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:153) 
org.apache.beam.sdk.io.gcp.bigquery.WriteRename.finishBundle(WriteRename.java:171)

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@Abacn
Copy link
Contributor Author

Abacn commented Feb 7, 2023

Has same root cause of #22372 and confirmed that the issue did not occur in Beam 2.39.0. While most of the use cases are fixed, this bug remains as of 2.45.0.

@Abacn
Copy link
Contributor Author

Abacn commented Feb 7, 2023

I think I have reproduced the error: https://ci-beam.apache.org/job/beam_PostCommit_Java_DataflowV2_PR/151/

run on branch: 2504882

Example jobId: 2023-02-07_11_45_58-12188672903944898798 in apache-beam-testing gcp project

@Abacn
Copy link
Contributor Author

Abacn commented Feb 7, 2023

UpdateSchemaDestination created by #17365 has no comment nor doc string. This task should also add necessary comments to that class.

@ahmedabu98
Copy link
Contributor

.take-issue

@Abacn
Copy link
Contributor Author

Abacn commented Feb 8, 2023

As @ahmedabu98 pointed out the original working example has typo. Initiated another job 2023-02-08_11_47_55-389031392081500435 branch: f446e5c

The problem is that the condition


is never true. The schema returned by DynamicsDestination object is:

GenericData{classInfo=[fields], {fields=[GenericData{classInfo=[categories, collation, defaultValueExpression, description,
fields, maxLength, mode, name, policyTags, precision, scale, type], {name=id_even, type=STRING}}, GenericData{classInfo=
[categories, collation, defaultValueExpression, description, fields, maxLength, mode, name, policyTags, precision, scale,
type], {name=ev_time, type=DATETIME}}]}}

schema by destinationTable.getSchema() is

{"fields":[{"mode":"REQUIRED","name":"id_even","type":"STRING"},
{"mode":"REQUIRED","name":"ev_time","type":"DATETIME"}]}

though they are effectively equivalent, and the temp table generated has the same schema on BigQueryUI, their gson representation is not the same.

@Abacn
Copy link
Contributor Author

Abacn commented Feb 8, 2023

I think I find the cause of the original issue (that in the issue description):

the processElement here does not consider the case of dynamic destination. It simply gets the first destination in the incoming list of element to setup zeroJob, and the outputs have have same destination.

@Abacn
Copy link
Contributor Author

Abacn commented Feb 8, 2023

The implementation of Java UpdateSchemaDestination and Python UpdateDestinationSchema is not the same. Python does not have this issue. In Python implementation both zero load job and copy job takes same PCollection as main input.

We should either change the java implementation to be same as Python, or make the input of UpdateSchemaDestination KV<DestinationT, Iterable<WriteTables.Result>> so each processElement deals with one destination.

@ahmedabu98
Copy link
Contributor

Regarding above comment about the following line, just tested it and equality check works fine even though they have a different String representation:

@Abacn
Copy link
Contributor Author

Abacn commented Feb 11, 2023

Regarding above comment about the following line, just tested it and equality check works fine even though they have a different String representation:

ah I see, thanks for clarification. so pass either wrapped or unwrapped dynamic destination to UpdateSchemaDestination is fine.

@ahmedabu98
Copy link
Contributor

so pass either wrapped or unwrapped dynamic destination to UpdateSchemaDestination is fine.

We still would still want to wrap with match table dynamicdestinations because that's what we're doing when creating temp tables. For a given temp table, we want to pull the same schema consistently for both operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants