-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-12093] Overhaul ElasticsearchIO.Write #14347
[BEAM-12093] Overhaul ElasticsearchIO.Write #14347
Conversation
As this is my first time contributing, I'm not sure exactly who to select as a reviewer. I see @echauchot in the git blame a lot for ElasticsearchIO.java, so I'll start there? 😂 |
Sure, I'll be happy to review. Thanks for your contribution ! |
I have the local dev env setup by using |
e4f2f52
to
ec6b1c9
Compare
@egalpin thanks for your contribution. I'm sorry I lack time a lot lately. In the meantime can you:
|
ed73351
to
ff97fd9
Compare
@echauchot I've made a jira ticket and linked it. |
9a6cd5d
to
15b128b
Compare
@echauchot FYI build is passing now 🙂 |
@echauchot let me know if you can take a look at this, and if not I can help you find more reviewers : ) |
@pabloem fyi I messaged echauchot on Apache beam slack previously and we had arranged to chat about tests together tomorrow. I wanted to let them know that I’d gotten past my blockers with respect to tests, but I definitely did not intend to apply pressure or name+shame or anything like that. |
oh I also didn't want to pressure anyone : ) I'm just checking. thanks for the update! |
Haha thanks! I appreciate it. I wanted to be sure I was respecting others' busy schedules ❤️ |
@egalpin starting first round of review sorry for the delay |
@timrobertson100 you might be interested. @ludovic-boutros you were thinking about changing the overall arhictecture of the IO and introduce a testContainer based test framework here: https://lists.apache.org/thread.html/reb68f37c435995a64ded19100e09dfc31c5cf6227feae16494226100%40%3Cdev.beam.apache.org%3E |
@pabloem @egalpin I'll do an overall review but I'd need another reviewer to do the in-depth review because:
|
Thanks @echauchot for sharing that thread. I really like a lot of what @ludovic-boutros proposed and have many shared goals; in particular, implementing the pattern where successful and failed writes can be returned via At the same time, I also see the argument that in many ways using the low-level client results in "reinventing the wheel" for a number of features (with good justification, IMO, of enabling cross-version support). I'd be very willing to contribute to brainstorming (and implementation once we reach that point) if others are open to that. |
Run Java PreCommit |
@egalpin @echauchot I made a quick review, but, well, sadly I don't have enough time these months to go deeper on the subject. |
=> Reading at the overall design goals, it looks very promising and a good analysis of the missing properties of the curent architecture ! Thanks !
=> True that very small batches can exist for example Flink being a streaming oriented platform, Flink runner tends to create very small Beam bundles. So, when the bundle is finished processing (finishBundle is called), the ES bulk request is sent leading to small ES bulk. Leveraging GroupIntoBatches that creates trans-bundle groups and still respect Beam semantics (windowing, bundle retries etc...) is a very good idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work Evan ! Thanks.
I like the analysis of the missing features and all the improvements you gave !
I did a pretty in depth review after all but I no more have time. @pabloem as you offered help, could you please do the other rounds of review and merge when ok ?
Besides, Evan, as you know ES very well, and you seem to be interested in contributing. Would you be interested in putting yourself in ES Owners file and jira ES label ?
...ava/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java
Show resolved
Hide resolved
...ava/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java
Show resolved
Hide resolved
...ava/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java
Show resolved
Hide resolved
...ava/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java
Outdated
Show resolved
Hide resolved
...ava/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java
Outdated
Show resolved
Hide resolved
...ava/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java
Show resolved
Hide resolved
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Show resolved
Hide resolved
...icsearch-tests-2/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTest.java
Outdated
Show resolved
Hide resolved
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Show resolved
Hide resolved
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Outdated
Show resolved
Hide resolved
@echauchot Thanks for the review, I'll work my way through your comments and suggestions.
I'd be very happy to 👍 I've added myself to the ES owners file now, happy to lend a hand reviewing! Thanks 🙂 With respect to Jira, could you please add appropriate permissions for me to either assign myself to the ES label, or assign me to the label yourself if that is the preferred workflow. I have an account on issues.apache.org/jira but only with permission to create tickets I believe. |
@echauchot added coverage for the methods you mentioned. Anything else outstanding? 🙂 |
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Outdated
Show resolved
Hide resolved
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Show resolved
Hide resolved
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Show resolved
Hide resolved
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only minor changes left. I think when you address them, I could hit the merge button provided the tests pass.
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Outdated
Show resolved
Hide resolved
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Outdated
Show resolved
Hide resolved
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Outdated
Show resolved
Hide resolved
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Outdated
Show resolved
Hide resolved
...ava/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java
Outdated
Show resolved
Hide resolved
Run Java PreCommit |
2 similar comments
Run Java PreCommit |
Run Java PreCommit |
Ready to roll! 🙂 |
@egalpin seems good ! thanks ! I just triggered the build. it's like fort knox now on resources consumption 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@egalpin ready to merge. only the 2 javadoc (assert numdocs/num scientits) fixes and the not needed gbk in parallel test (+consequent simplification of assertion function) and we're good to merge.
...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java
Outdated
Show resolved
Hide resolved
@echauchot all set for a final look-over 🙂 |
Run Java PreCommit |
2 similar comments
Run Java PreCommit |
Run Java PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,
Thanks for you great and hard work Evan ! And also thanks for your patience.
As stated in the guidelines, I'll squash the review commits into the first commit and merge.
Thanks Etienne for all of your reviewing efforts, and your warm welcome to Beam! 😄 |
Came across this while lurking. Where would one find docs on how to use ElasticsearchIO, including more advanced features like this? I saw some examples on the Beam site, but nothing specific to each source/sink. |
@mattwelke The javadoc[1] that is generated has some description and examples, though admittedly not as fully descriptive as it could be. I would welcome any additional examples or documentation added by others, and will keep in mind to add more examples when time permits! Do you have any specific questions or a use case you would like help determining how to best use this IO? I might suggest that we move the conversation to the user mailing list or slack[2] so that others could more easily benefit from our conversation 🙂 |
Those are a good start, thanks. For any more discussion, I'll use the mailing list or Slack. |
This change set represents a rather large (and backward compatible) change to the way ElastichsearchIO.Write operates. Presently, the Write transform has 2 responsibilities:
DocToBulk
BulkIO
This PR aims to separate these 2 responsibilities into discrete PTransforms to allow for greater flexibility while also maintaining the convenience of the Write transform to perform both document conversion and IO serially. Examples of how the flexibility of separate transforms could be used:
Expanding on point (6), this PR also introduces a new (optional) way to batch entities for bulk requests: Stateful Processing. Presently, Bulk request size is limited by the lesser of Runner bundle size and
maxBatchSize
user setting. In my experience, bundle sizes are often very small, and can be a small as 1 or 2. When that’s the case, it means Bulk requests contain only 1 or 2 documents, and it’s effectively the same as not using the Bulk API at all.BulkIOStatefulFn
is made to be compatible withGroupIntoBatches
which will use entity count and (optionally) elapsed time to create batches much closer to themaxBatchSize
setting to improve throughput.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.