[BEAM-12093] Overhaul ElasticsearchIO.Write #14347

egalpin · 2021-03-26T14:24:12Z

This change set represents a rather large (and backward compatible) change to the way ElastichsearchIO.Write operates. Presently, the Write transform has 2 responsibilities:

Convert input documents into Bulk API entities, serializing based on user settings (partial update, delete, upsert, etc) -> DocToBulk
Batch the converted Bulk API entities together and interface with the target ES cluster -> BulkIO

This PR aims to separate these 2 responsibilities into discrete PTransforms to allow for greater flexibility while also maintaining the convenience of the Write transform to perform both document conversion and IO serially. Examples of how the flexibility of separate transforms could be used:

Unit testing. It becomes trivial for pipeline developers to ensure that output Bulk API entities for a given set of inputs will produce an expected result, without the need for an available Elasticsearch cluster.
Flexible options for data backup. Serialized Bulk API entities can be forked and sent to both Elasticsearch and a data lake.
Mirroring data to multiple clusters. Presently, mirroring data to multiple clusters would require duplicate computation.
Better batching with input streams in one job. A job may produce multiple "shapes" of Bulk API entities based on multiple input types, and then "fan-in" all serialized Bulk entities into a single BulkIO transform to improve batching semantics.
Decoupled jobs. Corollary to (4) above. Job(s) could be made to produce Bulk entities and then publish them to a message bus. A distinct job could consume from that message bus and solely be responsible for IO with the target cluster(s).
Easier support for multiple BulkIO semantics.

Expanding on point (6), this PR also introduces a new (optional) way to batch entities for bulk requests: Stateful Processing. Presently, Bulk request size is limited by the lesser of Runner bundle size and maxBatchSize user setting. In my experience, bundle sizes are often very small, and can be a small as 1 or 2. When that’s the case, it means Bulk requests contain only 1 or 2 documents, and it’s effectively the same as not using the Bulk API at all. BulkIOStatefulFn is made to be compatible with GroupIntoBatches which will use entity count and (optionally) elapsed time to create batches much closer to the maxBatchSize setting to improve throughput.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	ULR	Dataflow	Samza	Twister2
Go	---	---	---	---
Java
Python	---		---	---
XLang	---		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---		---	---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

egalpin · 2021-03-26T14:28:32Z

As this is my first time contributing, I'm not sure exactly who to select as a reviewer. I see @echauchot in the git blame a lot for ElasticsearchIO.java, so I'll start there? 😂

echauchot · 2021-03-26T16:49:07Z

Sure, I'll be happy to review. Thanks for your contribution !

egalpin · 2021-03-26T17:06:15Z

I have the local dev env setup by using start-build-env.sh from this repo, but I'm still working towards running elasticsearch-tests. Any pointers would be appreciated if anyone has time 🙂

echauchot · 2021-04-02T08:28:52Z

@egalpin thanks for your contribution. I'm sorry I lack time a lot lately. In the meantime can you:

make the build pass (precommit fail)
as it is rather large contribution : please open a ticket with details and rename PR name.

egalpin · 2021-04-05T15:07:57Z

@echauchot I've made a jira ticket and linked it. I'm working on getting the build to pass but struggling a bit with trying to determine the cause of the new errors in the Java PreCommit build. All the warnings have to do with a single Kotlin example which seems far removed from the changes here. I'll keep poking away at it though. I've found the full output from the build and can see some compilation errors. Working on those.

egalpin · 2021-04-08T17:10:50Z

@echauchot FYI build is passing now 🙂

pabloem · 2021-04-08T17:53:09Z

@echauchot let me know if you can take a look at this, and if not I can help you find more reviewers : )

egalpin · 2021-04-08T18:05:15Z

@pabloem fyi I messaged echauchot on Apache beam slack previously and we had arranged to chat about tests together tomorrow. I wanted to let them know that I’d gotten past my blockers with respect to tests, but I definitely did not intend to apply pressure or name+shame or anything like that.

pabloem · 2021-04-08T18:12:19Z

oh I also didn't want to pressure anyone : ) I'm just checking. thanks for the update!

egalpin · 2021-04-08T18:15:05Z

Haha thanks! I appreciate it. I wanted to be sure I was respecting others' busy schedules ❤️

echauchot · 2021-04-12T10:10:29Z

@egalpin starting first round of review sorry for the delay

echauchot · 2021-04-12T10:16:25Z

@timrobertson100 you might be interested. @ludovic-boutros you were thinking about changing the overall arhictecture of the IO and introduce a testContainer based test framework here: https://lists.apache.org/thread.html/reb68f37c435995a64ded19100e09dfc31c5cf6227feae16494226100%40%3Cdev.beam.apache.org%3E
any comment ?

echauchot · 2021-04-12T11:11:56Z

@echauchot let me know if you can take a look at this, and if not I can help you find more reviewers : )

@pabloem @egalpin I'll do an overall review but I'd need another reviewer to do the in-depth review because:

I'm busy on many other things lately
I coded support for ES 2, ES 5, reviewed 6 and coded ES7. I'd like to pass the torch :)

egalpin · 2021-04-12T14:07:11Z

Thanks @echauchot for sharing that thread. I really like a lot of what @ludovic-boutros proposed and have many shared goals; in particular, implementing the pattern where successful and failed writes can be returned via MultiOutputReceiver. I believe the multi-output pattern could fit within the current IO with some additional effort (since order of request and response entities is guaranteed[1]), and I had planned to do that as a follow-up so as to not introduce even more changes in one PR.

At the same time, I also see the argument that in many ways using the low-level client results in "reinventing the wheel" for a number of features (with good justification, IMO, of enabling cross-version support).

I'd be very willing to contribute to brainstorming (and implementation once we reach that point) if others are open to that.

[1] https://discuss.elastic.co/t/order-of-actions-in-bulk-api-via-http-between-request-and-response-is-guaranteed/122499/2

egalpin · 2021-04-13T13:54:14Z

Run Java PreCommit

ludovic-boutros · 2021-04-13T16:13:49Z

@egalpin @echauchot I made a quick review, but, well, sadly I don't have enough time these months to go deeper on the subject.
We are using the module I shared in production for months without any issue.
The development of a new multi module component is currently in stand by with one module per Elasticsearch version and a top level abstraction.
It's not finished yet and with the Covid situation I had to refocus on other projects. That means that I will not get any time to really go further soon. I'm available to share on this with you (Slack ?).
I think political decision should be taken first before taking the direction I proposed. But it's not the place to have this discussion.

echauchot · 2021-04-19T08:51:43Z

This change set represents a rather large (and backward compatible) change to the way ElastichsearchIO.Write operates. Presently, the Write transform has 2 responsibilities:

Convert input documents into Bulk API entities, serializing based on user settings (partial update, delete, upsert, etc) -> DocToBulk

Batch the converted Bulk API entities together and interface with the target ES cluster -> BulkIO

This PR aims to separate these 2 responsibilities into discrete PTransforms to allow for greater flexibility while also maintaining the convenience of the Write transform to perform both document conversion and IO serially. Examples of how the flexibility of separate transforms could be used:

Unit testing. It becomes trivial for pipeline developers to ensure that output Bulk API entities for a given set of inputs will produce an expected result, without the need for an available Elasticsearch cluster.

Flexible options for data backup. Serialized Bulk API entities can be forked and sent to both Elasticsearch and a data lake.

Mirroring data to multiple clusters. Presently, mirroring data to multiple clusters would require duplicate computation.

Better batching with input streams in one job. A job may produce multiple "shapes" of Bulk API entities based on multiple input types, and then "fan-in" all serialized Bulk entities into a single BulkIO transform to improve batching semantics.

Decoupled jobs. Corollary to (4) above. Job(s) could be made to produce Bulk entities and then publish them to a message bus. A distinct job could consume from that message bus and solely be responsible for IO with the target cluster(s).

Easier support for multiple BulkIO semantics.

=> Reading at the overall design goals, it looks very promising and a good analysis of the missing properties of the curent architecture ! Thanks !

Expanding on point (6), this PR also introduces a new (optional) way to batch entities for bulk requests: Stateful Processing. Presently, Bulk request size is limited by the lesser of Runner bundle size and maxBatchSize user setting. In my experience, bundle sizes are often very small, and can be a small as 1 or 2. When that’s the case, it means Bulk requests contain only 1 or 2 documents, and it’s effectively the same as not using the Bulk API at all. BulkIOStatefulFn is made to be compatible with GroupIntoBatches which will use entity count and (optionally) elapsed time to create batches much closer to the maxBatchSize setting to improve throughput.

=> True that very small batches can exist for example Flink being a streaming oriented platform, Flink runner tends to create very small Beam bundles. So, when the bundle is finished processing (finishBundle is called), the ES bulk request is sent leading to small ES bulk. Leveraging GroupIntoBatches that creates trans-bundle groups and still respect Beam semantics (windowing, bundle retries etc...) is a very good idea.

echauchot

Very good work Evan ! Thanks.

I like the analysis of the missing features and all the improvements you gave !
I did a pretty in depth review after all but I no more have time. @pabloem as you offered help, could you please do the other rounds of review and merge when ok ?

Besides, Evan, as you know ES very well, and you seem to be interested in contributing. Would you be interested in putting yourself in ES Owners file and jira ES label ?

...ava/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java

...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java

...icsearch-tests-2/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTest.java

...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java

egalpin · 2021-04-20T19:03:18Z

@echauchot Thanks for the review, I'll work my way through your comments and suggestions.

Besides, Evan, as you know ES very well, and you seem to be interested in contributing. Would you be interested in putting yourself in ES Owners file and jira ES label ?

I'd be very happy to 👍 I've added myself to the ES owners file now, happy to lend a hand reviewing! Thanks 🙂

With respect to Jira, could you please add appropriate permissions for me to either assign myself to the ES label, or assign me to the label yourself if that is the preferred workflow. I have an account on issues.apache.org/jira but only with permission to create tickets I believe.

egalpin · 2021-05-06T10:49:52Z

@echauchot added coverage for the methods you mentioned. Anything else outstanding? 🙂

...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java

…orm args

echauchot

only minor changes left. I think when you address them, I could hit the merge button provided the tests pass.

...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java

...ava/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java

egalpin · 2021-05-12T00:16:10Z

Run Java PreCommit

egalpin · 2021-05-12T02:11:33Z

Run Java PreCommit

egalpin · 2021-05-12T13:26:24Z

Run Java PreCommit

egalpin · 2021-05-19T13:12:13Z

Ready to roll! 🙂

echauchot · 2021-05-21T14:54:57Z

@egalpin seems good ! thanks ! I just triggered the build. it's like fort knox now on resources consumption 😄

echauchot

@egalpin ready to merge. only the 2 javadoc (assert numdocs/num scientits) fixes and the not needed gbk in parallel test (+consequent simplification of assertion function) and we're good to merge.

...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java

egalpin · 2021-05-25T13:11:47Z

@echauchot all set for a final look-over 🙂

egalpin · 2021-05-25T13:49:27Z

Run Java PreCommit

egalpin · 2021-05-25T14:13:03Z

Run Java PreCommit

egalpin · 2021-05-25T14:51:39Z

Run Java PreCommit

echauchot

LGTM,
Thanks for you great and hard work Evan ! And also thanks for your patience.
As stated in the guidelines, I'll squash the review commits into the first commit and merge.

egalpin · 2021-05-27T13:34:04Z

Thanks Etienne for all of your reviewing efforts, and your warm welcome to Beam! 😄

mattwelke · 2022-01-22T04:37:38Z

Came across this while lurking. Where would one find docs on how to use ElasticsearchIO, including more advanced features like this? I saw some examples on the Beam site, but nothing specific to each source/sink.

egalpin · 2022-01-22T13:27:56Z

@mattwelke The javadoc[1] that is generated has some description and examples, though admittedly not as fully descriptive as it could be. I would welcome any additional examples or documentation added by others, and will keep in mind to add more examples when time permits!

Do you have any specific questions or a use case you would like help determining how to best use this IO? I might suggest that we move the conversation to the user mailing list or slack[2] so that others could more easily benefit from our conversation 🙂

[1] https://beam.apache.org/releases/javadoc/2.35.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.html

[2] https://beam.apache.org/community/join-beam/

timrobertson100 · 2022-01-22T13:38:55Z

The tests might also be worth looking at @mattwelke

E.g.
https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch-tests/elasticsearch-tests-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java#L770

mattwelke · 2022-01-24T03:04:01Z

Those are a good start, thanks. For any more discussion, I'll use the mailing list or Slack.

egalpin force-pushed the elasticsearchio-support-stateful branch from e4f2f52 to ec6b1c9 Compare March 30, 2021 14:14

pabloem requested a review from echauchot April 1, 2021 21:13

egalpin changed the title ~~Overhaul ElasticsearchIO.Write~~ [BEAM-12093] Overhaul ElasticsearchIO.Write Apr 5, 2021

egalpin force-pushed the elasticsearchio-support-stateful branch from ed73351 to ff97fd9 Compare April 5, 2021 13:48

egalpin force-pushed the elasticsearchio-support-stateful branch 7 times, most recently from 9a6cd5d to 15b128b Compare April 8, 2021 15:13

echauchot requested changes Apr 20, 2021

View reviewed changes

egalpin requested a review from echauchot May 5, 2021 04:57

echauchot reviewed May 10, 2021

View reviewed changes

egalpin added 4 commits May 10, 2021 14:11

Fixup warning in docstring for withBackendVersion

54f2fbc

Fixup test docstring and add test function for dupe code

7c1ed24

Fixes typo in withMaxParallelRequestsPerWindow

6d0997e

Updates elasticsearchIO main javadoc to only mention required PTransf…

b82f3ff

…orm args

echauchot requested changes May 11, 2021

View reviewed changes

egalpin added 2 commits May 11, 2021 16:02

Makes batch private

ff6b177

Test fixes

9c1c75c

egalpin requested a review from echauchot May 17, 2021 12:19

echauchot requested changes May 24, 2021

View reviewed changes

...sts-common/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOTestCommon.java Outdated Show resolved Hide resolved

Removes unnecessary GBK in ES IO parallel window test

a559b03

echauchot approved these changes May 27, 2021

View reviewed changes

echauchot merged commit 4f1f1c1 into apache:master May 27, 2021

egalpin mentioned this pull request Jun 7, 2022

Change batch handling in ElasticsearchIO to avoid necessity for GroupIntoBatches #19444

Closed

[BEAM-12093] Overhaul ElasticsearchIO.Write #14347

[BEAM-12093] Overhaul ElasticsearchIO.Write #14347

Conversation

egalpin commented Mar 26, 2021 • edited Loading

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

egalpin commented Mar 26, 2021

echauchot commented Mar 26, 2021

egalpin commented Mar 26, 2021

echauchot commented Apr 2, 2021

egalpin commented Apr 5, 2021 • edited Loading

egalpin commented Apr 8, 2021

pabloem commented Apr 8, 2021

egalpin commented Apr 8, 2021

pabloem commented Apr 8, 2021

egalpin commented Apr 8, 2021

echauchot commented Apr 12, 2021

echauchot commented Apr 12, 2021

echauchot commented Apr 12, 2021

egalpin commented Apr 12, 2021 • edited Loading

egalpin commented Apr 13, 2021

ludovic-boutros commented Apr 13, 2021

echauchot commented Apr 19, 2021

echauchot left a comment

Choose a reason for hiding this comment

egalpin commented Apr 20, 2021

egalpin commented May 6, 2021

echauchot left a comment

Choose a reason for hiding this comment

egalpin commented May 12, 2021

egalpin commented May 12, 2021

egalpin commented May 12, 2021

egalpin commented May 19, 2021

echauchot commented May 21, 2021

echauchot left a comment

Choose a reason for hiding this comment

egalpin commented May 25, 2021

egalpin commented May 25, 2021

egalpin commented May 25, 2021

egalpin commented May 25, 2021

echauchot left a comment • edited Loading

Choose a reason for hiding this comment

egalpin commented May 27, 2021

mattwelke commented Jan 22, 2022

egalpin commented Jan 22, 2022

timrobertson100 commented Jan 22, 2022

mattwelke commented Jan 24, 2022

egalpin commented Mar 26, 2021 •

edited

Loading

egalpin commented Apr 5, 2021 •

edited

Loading

egalpin commented Apr 12, 2021 •

edited

Loading

echauchot left a comment •

edited

Loading