Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize parameter parsing in text chunking processor #733

Merged
merged 10 commits into from
May 21, 2024

Conversation

yuye-aws
Copy link
Member

@yuye-aws yuye-aws commented May 6, 2024

Description

There PR optimizes the parameter parsing step in chunking algorithms. For both algorithms, all runtime parameters for text chunking processor is always available. Therefore, default value is not needed.

Issues Resolved

Optimize parameter parsing in text chunking processor.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@yuye-aws
Copy link
Member Author

yuye-aws commented May 6, 2024

@zane-neo This PR resolves your comment in #717:

In fact, with field_map and document, we can always calculate a chunk_string_count, we don't need a default value here. Suggesting optimize this code in next PR

Please help review thanks.

@yuye-aws
Copy link
Member Author

yuye-aws commented May 6, 2024

We need to rerun the CI checks to check if there are any flakey tests.

@yuye-aws
Copy link
Member Author

yuye-aws commented May 6, 2024

The rolling upgrade BWC test failure is the same in #734

@yuye-aws
Copy link
Member Author

Hi maintainer, please attach backport-2.x label to this PR.

@yuye-aws
Copy link
Member Author

The rolling upgrade BWC test is failing due to cluster health status issue. Can we rerun these tests?

* Throw IllegalArgumentException if parameter is missing or is not an integer.
*/
public static int parseIntegerParameter(final Map<String, Object> parameters, final String fieldName) {
if (!parameters.containsKey(fieldName)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!parameters.containsKey(fieldName)) {
return parseIntegerParameter(parameters, fieldName, null);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this design doesn't follow conventions in other place in Java API.
If you calling parse(argument) than it should execute parsing logic
If you calling parse(argument, defaultValue) than it should try to execute parsing logic, and return default value in case that parsing logic return empty value

In your code everywhere where this new method called with null you can call your new method without default value.
And in that new method implementation you can just call this method with defaultValue = null, see my previous comment with suggestion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I don't think the function is necessary. I have removed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you calling parse(argument, defaultValue) than it should try to execute parsing logic, and return default value in case that parsing logic return empty value

Not very sure what you mean here. We should throw illegal argument exception if necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your code everywhere where this new method called with null you can call your new method without default value.

Updated by Zan's comment. You can review now.

Signed-off-by: yuye-aws <[email protected]>
// all integer parameters are optional
return defaultValue;
// return the default value when parameter is not existing
if (defaultValue == null) {
Copy link
Collaborator

@zane-neo zane-neo May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest splitting this method to three new methods:

  1. parseInteger: Simply parse integer with catching number format exception.
  2. parseIntegerWithDefault: if field not shown, return default value, else return parseInteger.
  3. parseIntegerWithException: if field not shown, throw IllegalStateException, else return parseInteger.

This won't add repeated code also can make sure the method 2 have concise semantic by making the default value type to int, currently default value accepts null which is not in a correct semantic. For runtime parameters, there isn't default value, so throwing exception is the correct semantic for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all integer parameters are optional. I will not keep the parseIntegerWithException method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, runtime parameters are always available while non-runtime parameters have default values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@martin-gaievski martin-gaievski added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label May 17, 2024
this.maxChunkLimit = parseIntegerParameter(chunkerParameters, MAX_CHUNK_LIMIT_FIELD, DEFAULT_MAX_CHUNK_LIMIT);
if (maxChunkLimit < 0 && maxChunkLimit != DISABLED_MAX_CHUNK_LIMIT) {
this.maxChunkLimit = parseIntegerWithDefault(chunkerParameters, MAX_CHUNK_LIMIT_FIELD, DEFAULT_MAX_CHUNK_LIMIT);
if (maxChunkLimit <= 0 && maxChunkLimit != DISABLED_MAX_CHUNK_LIMIT) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am making this change to keep it consistent with the error message:
Parameter [%s] must be positive or %s to disable this parameter"

Copy link
Member Author

@yuye-aws yuye-aws May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think maxChunkLimit can be 0?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, it's fine to add = 0 here.

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks good to me. One comment - can you add a unit test for ChunkerParameterParser class? As we're adding more methods to it having specialized test class will give more confidence

@yuye-aws
Copy link
Member Author

Overall code looks good to me. One comment - can you add a unit test for ChunkerParameterParser class? As we're adding more methods to it having specialized test class will give more confidence

Sure. I will add unit tests today.

yuye-aws added 3 commits May 20, 2024 11:04
Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: yuye-aws <[email protected]>
@yuye-aws
Copy link
Member Author

HI @zane-neo and @martin-gaievski ! I have implemented the unit tests for chunker parameter parser. Can you help review again?

@@ -24,7 +24,6 @@
import java.util.LinkedList;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ExecutorService;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is due to spotless apply

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thank you

@yuye-aws yuye-aws requested a review from martin-gaievski May 21, 2024 02:11
@yuye-aws
Copy link
Member Author

@martin-gaievski Can you merge the PR?

@yuye-aws
Copy link
Member Author

The last approver is expected to merge the PR.

@zane-neo zane-neo merged commit 038b1ec into opensearch-project:main May 21, 2024
67 of 73 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 21, 2024
* Optimize parameter parsing in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add change log

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests in delimiter chunker

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests in fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* remove redundant

Signed-off-by: yuye-aws <[email protected]>

* refactor chunker parameter parser

Signed-off-by: yuye-aws <[email protected]>

* unit tests for chunker parameter parser

Signed-off-by: yuye-aws <[email protected]>

* fix comment

Signed-off-by: yuye-aws <[email protected]>

* spotless apply

Signed-off-by: yuye-aws <[email protected]>

---------

Signed-off-by: yuye-aws <[email protected]>
(cherry picked from commit 038b1ec)
martin-gaievski pushed a commit that referenced this pull request May 21, 2024
Signed-off-by: yuye-aws <[email protected]>
(cherry picked from commit 038b1ec)

Co-authored-by: yuye-aws <[email protected]>
VijayanB pushed a commit that referenced this pull request May 29, 2024
* Optimize parameter parsing in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add change log

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests in delimiter chunker

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests in fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* remove redundant

Signed-off-by: yuye-aws <[email protected]>

* refactor chunker parameter parser

Signed-off-by: yuye-aws <[email protected]>

* unit tests for chunker parameter parser

Signed-off-by: yuye-aws <[email protected]>

* fix comment

Signed-off-by: yuye-aws <[email protected]>

* spotless apply

Signed-off-by: yuye-aws <[email protected]>

---------

Signed-off-by: yuye-aws <[email protected]>
@yuye-aws yuye-aws deleted the Fix/ChunkingParameterParse branch September 10, 2024 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants