-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-10856] Support for NestedValueProvider for Python SDK #12779
Conversation
Thanks @epicfaace for implementing these! I will not be able to review them for a couple of weeks. I can review once I'm back, or I can help find a new reviewer. What do you prefer? |
Thanks for the heads up @pabloem . @tvalentyn , would you be able to review? |
By the way, how do you normally lint your code according to the Beam guidelines / PR checks? I couldn't find any guidance / a quick reference in the Beam development guide on how to do that. |
https://cwiki.apache.org/confluence/display/BEAM/Python+Tips has some autoformatter tips that can help. |
@epicfaace Curious, what use-case do you have in mind for this change? |
I'm creating a Dataflow template in which I need to write to multiple different output tables, "[prefix]_a", "[prefix]_b", "[prefix]_c", etc. I'd like to be able to specify a single runtime parameter, "prefix", and derive the output table names from this parameter, rather than having to redundantly specify each table name as an input runtime parameter. I noticed that this page said that the Apache Beam SDK for Python does not support NestedValueProvider, so I just implemented it in Python in a similar way to how the Java SDK does it. |
Ok. We actually consider ValueProvider's an anti-pattern now, prefer not to extend their scope, and recommend to use FlexTemplates instead, see: https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates. Could you give that a try? I do appreciate your effort to improve the SDK and documentation though! CC: @azurezyq who is an expert on Flex Templates. |
Hmm, I did try using Flex Templates initially, but it took way too long -- maybe 7-8 minutes -- for the job to initialize, and it appears that the bottleneck is with was installing apache-beam and other pip dependencies. I switched to using ValueProviders because I needed a quick turnaround time for the job (the job itself takes around 10 minutes, so a 7-8 minute additional startup time is not worth it for me). |
I see, I'd like @arvindram03 and @azurezyq chime in here on their recommendation. I think there are undergoing efforts to reduce the startup time, or perhaps there are some tips to reduce it? |
Hi @epicfaace , We agree the startup time with flex template python jobs is an issue. We are are working towards reducing the overall startup time to 3-4 mins by end of this year. We recommend using flex templates instead of classic ones purely for flexibility and also you can leverage more feature expansions in the near future. One of the major objective of flex templates is to avoid using ValueProviders. So, further overloading them with new features is not aligned with the roadmap. |
@arvindram03 , thanks for the update. To clarify, this PR only aims for feature parity with the Java SDK, which already supports NestedValueProvider -- I'm not "overloading them with new features". Additionally, Flex Templates is still a Pre-GA Offering, and as you mentioned, requires some work for it to be as efficient as regular templates. For these reasons, I don't think they're really a viable alternative to traditional templates at the moment (of course, with additional improvements by the end of the year, it could be!) |
A user has written a feature that they would find useful, and that will not change the experience for other users (if anything, it should improve it). The feature looks correct, and similar to what we do in Java. If we reject the PR, we may push the user to run on a fork. Can we let this in? @tvalentyn |
I agree with this assessment, feel free to merge once tests & linter pass. |
Codecov Report
@@ Coverage Diff @@
## master #12779 +/- ##
==========================================
- Coverage 82.48% 82.28% -0.21%
==========================================
Files 455 451 -4
Lines 54876 53738 -1138
==========================================
- Hits 45266 44217 -1049
+ Misses 9610 9521 -89
Continue to review full report at Codecov.
|
@epicfaace can you ensure PythonDocs and PythonLint checks pass? |
@epicfaace LMK if you can take a look at the lint issues |
What is the next step on this PR? |
@epicfaace there's a number of failing tests. (Python Docs and Lint) - can you fix those? |
@epicfaace @pabloem |
@tvalentyn |
@epicfaace Fixed in yet another pr: epicfaace#2 |
@pabloem, Python Docs and Lint tests now succeed. |
@pabloem I'm not quite sure why CI is still failing, it gives this error:
Any ideas? Or is the CI just flaky? |
This one looks like a test on the asserts. The actual error (which is likely a flake) is: logs: https://github.com/apache/beam/pull/12779/checks?check_run_id=1433935136
Maybe clean should catch the error and not fail? Or try again? @kevingg - Could you check this error? @pabloem @epicfaace - I believe you can ignore this error for the purposes of this PR. |
The problem is windows specific. I tried on Mac OS, the Potential breaking point is in the streaming cache where the This probably does not happen that often, we can revisit this if it happens frequently and ignore this by specifying |
Maybe file a JIRA and we can track how often this happens? I do not think people triage all failures so it is hard to tell how frequently this happens in pre commit tests. |
Filed BEAM-11339 for this. |
thank you @epicfaace ! |
Support for NestedValueProvider for Python SDK. I've also added more comments to the pydoc for value_provider classes. R: @pabloem
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.