Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Reimplement BulkProcessor #181

Open
markmccallion28 opened this issue Jul 20, 2022 · 10 comments
Open

[FEATURE] Reimplement BulkProcessor #181

markmccallion28 opened this issue Jul 20, 2022 · 10 comments
Labels
enhancement New feature or request performance Make it fast!

Comments

@markmccallion28
Copy link

Are there any plans on adding an equivalent of the BulkProcessor API, which was available in the High Level Rest Client, so that Index and Delete requests can be batched?

@dblock
Copy link
Member

dblock commented Jul 20, 2022

Would love it if someone (you?) could contribute this! I suppose that code can be just ported here? I don't see any obvious cons.

@dblock dblock added the enhancement New feature or request label Jul 20, 2022
@dblock dblock changed the title Is there a replacement for the BulkProcessor? [FEATURE] Reimplement BulkProcessor Jul 20, 2022
@ginkel
Copy link

ginkel commented Dec 21, 2022

FYI: We are currently looking into porting the BulkProcessor from the OpenSearch code base and noticed one missing feature in the Java-Client API :

While The RHLC allows to estimate the size of a Bulk Action (now called BulkOperation), this feature seems to be missing from this client als the JSON is no longer rendered when adding the payload to the bulk, but lazily when eventually dispatching the request.

Can you think of an efficient way to perform such a size estimation or would you rather drop the bulkSize configuration option from the BulkProcessor for now?

@reta
Copy link
Collaborator

reta commented Dec 21, 2022

Can you think of an efficient way to perform such a size estimation or would you rather drop the bulkSize configuration option from the BulkProcessor for now?

I think that would significantly degrade the BulkProcessor usefulness: fe Apache Flink uses bulkSize as (one of) flush triggers. I am pretty sure that many other projects rely on it as well.

@ginkel
Copy link

ginkel commented Jan 4, 2023

Elastic have just merged BulkIngester to their elasticsearch-java client, which covers the old BulkProcessor's features except for the retry handling. Licensed under the Apache License, Version 2.0.

We did a quick (preliminary) port to the opensearch-java client, which worked pretty smooth (tests and performance tests still pending).

Would such a port be something that you'd consider worth and acceptable contributing?

@reta
Copy link
Collaborator

reta commented Jan 4, 2023

@ginkel this is very tricky (taking into account numerous precedents with Elastic as a company). Yes the license seems to be ASFv2, we may ask the contributor if he is open to submit the BulkIngester pull request to OpenSearch as well, but I would be very caution with cherry-picking anything under Elastic organization. @dblock @nknize thoughts guys?

@dblock
Copy link
Member

dblock commented Jan 4, 2023

We will gladly accept code under the APLv2 license. The contributor needs to make sure that they have not looked at, or copied any non-APLv2 code while reimplementing a feature in OpenSearch. If the client is indeed APLv2 it's all good.

@ginkel
Copy link

ginkel commented Jan 4, 2023

All ported code carries the following license header:

/*
 * Licensed to Elasticsearch B.V. under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch B.V. licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */

@dblock
Copy link
Member

dblock commented Jan 4, 2023

@ginkel Yes. Keep those headers please and add the OpenSearch ones.

@tballison
Copy link

Any updates on this? This is a blocker on https://issues.apache.org/jira/browse/NUTCH-2994. Let me know if I can help.

@wbeckler
Copy link

wbeckler commented Jun 9, 2023 via email

@dblock dblock added the performance Make it fast! label Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make it fast!
Projects
Development

No branches or pull requests

6 participants