Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit schema support to JdbcIO read and xlang transform. #34128

Merged
merged 4 commits into from
Mar 3, 2025

Conversation

claudevdm
Copy link
Collaborator

@claudevdm claudevdm commented Feb 28, 2025

  • Added withSchema() to JdbcIO.ReadRows and JdbcIO.ReadWithPartitions in Java
  • Wired schema through JdbcSchemaIOProvider
  • Added schema support to xlang jcbc.py ReadFromJdbc
  • Refactored xlang tests to initializing containers once at the class level instead of per test
  • Added tests for custom SQL statements in Python (there was a todo in the code)

Postcommit xlang test run https://github.com/apache/beam/actions/runs/13635975481/job/38114752052?pr=34128

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Copy link

codecov bot commented Mar 2, 2025

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 59.30%. Comparing base (f5ed586) to head (6832b61).
Report is 58 commits behind head on master.

Files with missing lines Patch % Lines
sdks/python/apache_beam/io/jdbc.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #34128      +/-   ##
============================================
+ Coverage     59.25%   59.30%   +0.04%     
  Complexity     3272     3272              
============================================
  Files          1164     1166       +2     
  Lines        178325   178633     +308     
  Branches       3413     3393      -20     
============================================
+ Hits         105675   105936     +261     
- Misses        69250    69302      +52     
+ Partials       3400     3395       -5     
Flag Coverage Δ
python 81.27% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@claudevdm claudevdm force-pushed the jdbc-shcema branch 4 times, most recently from 9d8ccdc to 5afc467 Compare March 3, 2025 16:46
@claudevdm claudevdm changed the title Jdbc shcema Add explicit schema support to JdbcIO read and xlang transform. Mar 3, 2025
@github-actions github-actions bot removed the build label Mar 3, 2025
@claudevdm claudevdm requested a review from Abacn March 3, 2025 17:57
@claudevdm claudevdm marked this pull request as ready for review March 3, 2025 17:57
@claudevdm
Copy link
Collaborator Author

addresses #23029

@claudevdm claudevdm requested a review from damccorm March 3, 2025 17:59
Copy link
Contributor

github-actions bot commented Mar 3, 2025

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label python.
R: @Abacn for label java.
R: @shunping for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Just to confirm, have you been able to verify it on Dataflow runner? e.g. submit a jdbc read pipeline locally where it does not have access to database; and Dataflow workers have access and the pipeline runs successfully

@@ -84,7 +84,7 @@ public Schema configurationSchema() {
*/
@Override
public JdbcSchemaIO from(String location, Row configuration, @Nullable Schema dataSchema) {
return new JdbcSchemaIO(location, configuration);
return new JdbcSchemaIO(location, configuration, dataSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, previously the dataSchema parameter wasn't passed down stream

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah and in the JdbcIO#readRows and readWithPartitions there schema case wasnt handled.

I also need to add the ability to explicitly pass the partition lower and upper bounds through xlang, otherwise that part will also happen at pipeline construction time. Will do in a followup later this week.

@claudevdm
Copy link
Collaborator Author

claudevdm commented Mar 3, 2025

Thanks!

Just to confirm, have you been able to verify it on Dataflow runner? e.g. submit a jdbc read pipeline locally where it does not have access to database; and Dataflow workers have access and the pipeline runs successfully

I confirmed that infer schema doesn't happen during pipeline construction when the schema parameter is passed :D .

I didn't try to run on DataflowRunner because I am not sure how I am supposed to use the modified Java SDK expansion service. For example, I usually pass something like sdk_harness_container_image_overrides=.java.,gcr.io/cloud-dataflow/v1beta3/beam_java11_sdk:2.62.0, does that mean I need to build the Java SDK container etc?

If you look at the custom query/write statement tests, those schema inferences would file if we don't pass a custom schema. So it indirectly verifies the solution.

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@Abacn Abacn merged commit a994291 into apache:master Mar 3, 2025
101 of 102 checks passed
claudevdm pushed a commit to claudevdm/beam that referenced this pull request Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants