Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize parameter parsing in text chunking processor #733

Merged
merged 10 commits into from
May 21, 2024
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
### Features
### Enhancements
- Pass empty doc collector instead of top docs collector to improve hybrid query latencies by 20% ([#731](https://github.com/opensearch-project/neural-search/pull/731))
- Optimize parameter parsing in text chunking processor ([#733](https://github.com/opensearch-project/neural-search/pull/733))
### Bug Fixes
### Infrastructure
### Documentation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,36 @@ public static String parseStringParameter(final Map<String, Object> parameters,
return fieldValue.toString();
}

/**
* Parse integer type parameter without default value
* Throw IllegalArgumentException if parameter is missing or is not an integer.
*/
public static int parseIntegerParameter(final Map<String, Object> parameters, final String fieldName) {
if (!parameters.containsKey(fieldName)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!parameters.containsKey(fieldName)) {
return parseIntegerParameter(parameters, fieldName, null);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this design doesn't follow conventions in other place in Java API.
If you calling parse(argument) than it should execute parsing logic
If you calling parse(argument, defaultValue) than it should try to execute parsing logic, and return default value in case that parsing logic return empty value

In your code everywhere where this new method called with null you can call your new method without default value.
And in that new method implementation you can just call this method with defaultValue = null, see my previous comment with suggestion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I don't think the function is necessary. I have removed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you calling parse(argument, defaultValue) than it should try to execute parsing logic, and return default value in case that parsing logic return empty value

Not very sure what you mean here. We should throw illegal argument exception if necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your code everywhere where this new method called with null you can call your new method without default value.

Updated by Zan's comment. You can review now.

throw new IllegalArgumentException(String.format(Locale.ROOT, "Parameter [%s] is missing", fieldName));
}
String fieldValueString = parameters.get(fieldName).toString();
try {
return NumberUtils.createInteger(fieldValueString);
} catch (NumberFormatException e) {
throw new IllegalArgumentException(
String.format(Locale.ROOT, "Parameter [%s] must be of %s type", fieldName, Integer.class.getName())
);
}
}

/**
* Parse integer type parameter.
* Throw IllegalArgumentException if parameter is not an integer.
* Throw IllegalArgumentException if both parameter and default value is missing or parameter is not an integer.
*/
public static int parseIntegerParameter(final Map<String, Object> parameters, final String fieldName, final int defaultValue) {
public static int parseIntegerParameter(final Map<String, Object> parameters, final String fieldName, final Integer defaultValue) {
if (!parameters.containsKey(fieldName)) {
// all integer parameters are optional
return defaultValue;
// return the default value when parameter is not existing
if (defaultValue == null) {
Copy link
Collaborator

@zane-neo zane-neo May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest splitting this method to three new methods:

  1. parseInteger: Simply parse integer with catching number format exception.
  2. parseIntegerWithDefault: if field not shown, return default value, else return parseInteger.
  3. parseIntegerWithException: if field not shown, throw IllegalStateException, else return parseInteger.

This won't add repeated code also can make sure the method 2 have concise semantic by making the default value type to int, currently default value accepts null which is not in a correct semantic. For runtime parameters, there isn't default value, so throwing exception is the correct semantic for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all integer parameters are optional. I will not keep the parseIntegerWithException method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, runtime parameters are always available while non-runtime parameters have default values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

throw new IllegalArgumentException(String.format(Locale.ROOT, "Parameter [%s] is missing", fieldName));
} else {
return defaultValue;
}
}
String fieldValueString = parameters.get(fieldName).toString();
try {
Expand All @@ -60,9 +82,13 @@ public static int parseIntegerParameter(final Map<String, Object> parameters, fi

/**
* Parse integer type parameter with positive value.
* Throw IllegalArgumentException if parameter is not a positive integer.
* Throw IllegalArgumentException if both parameter and default value is missing or parameter is not a positive integer.
*/
public static int parsePositiveIntegerParameter(final Map<String, Object> parameters, final String fieldName, final int defaultValue) {
public static int parsePositiveIntegerParameter(
final Map<String, Object> parameters,
final String fieldName,
final Integer defaultValue
) {
int fieldValueInt = parseIntegerParameter(parameters, fieldName, defaultValue);
if (fieldValueInt <= 0) {
throw new IllegalArgumentException(String.format(Locale.ROOT, "Parameter [%s] must be positive.", fieldName));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ public final class DelimiterChunker implements Chunker {
public static final String DEFAULT_DELIMITER = "\n\n";

private String delimiter;
private int maxChunkLimit;

public DelimiterChunker(final Map<String, Object> parameters) {
parseParameters(parameters);
Expand All @@ -40,7 +39,6 @@ public DelimiterChunker(final Map<String, Object> parameters) {
@Override
public void parseParameters(Map<String, Object> parameters) {
this.delimiter = parseStringParameter(parameters, DELIMITER_FIELD, DEFAULT_DELIMITER);
this.maxChunkLimit = parseIntegerParameter(parameters, MAX_CHUNK_LIMIT_FIELD, DEFAULT_MAX_CHUNK_LIMIT);
}

/**
Expand All @@ -53,8 +51,8 @@ public void parseParameters(Map<String, Object> parameters) {
*/
@Override
public List<String> chunk(final String content, final Map<String, Object> runtimeParameters) {
int runtimeMaxChunkLimit = parseIntegerParameter(runtimeParameters, MAX_CHUNK_LIMIT_FIELD, maxChunkLimit);
int chunkStringCount = parseIntegerParameter(runtimeParameters, CHUNK_STRING_COUNT_FIELD, 1);
int runtimeMaxChunkLimit = parseIntegerParameter(runtimeParameters, MAX_CHUNK_LIMIT_FIELD, null);
int chunkStringCount = parseIntegerParameter(runtimeParameters, CHUNK_STRING_COUNT_FIELD, null);

List<String> chunkResult = new ArrayList<>();
int start = 0, end;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,9 @@ public final class FixedTokenLengthChunker implements Chunker {
public static final String MAX_TOKEN_COUNT_FIELD = "max_token_count";
public static final String TOKENIZER_FIELD = "tokenizer";

// default values for each parameter
// default values for each non-runtime parameter
private static final int DEFAULT_TOKEN_LIMIT = 384;
private static final double DEFAULT_OVERLAP_RATE = 0.0;
private static final int DEFAULT_MAX_TOKEN_COUNT = 10000;
private static final String DEFAULT_TOKENIZER = "standard";

// parameter restrictions
Expand All @@ -54,7 +53,6 @@ public final class FixedTokenLengthChunker implements Chunker {

// parameter value
private int tokenLimit;
private int maxChunkLimit;
private String tokenizer;
private double overlapRate;
private final AnalysisRegistry analysisRegistry;
Expand Down Expand Up @@ -84,7 +82,6 @@ public void parseParameters(Map<String, Object> parameters) {
this.tokenLimit = parsePositiveIntegerParameter(parameters, TOKEN_LIMIT_FIELD, DEFAULT_TOKEN_LIMIT);
this.overlapRate = parseDoubleParameter(parameters, OVERLAP_RATE_FIELD, DEFAULT_OVERLAP_RATE);
this.tokenizer = parseStringParameter(parameters, TOKENIZER_FIELD, DEFAULT_TOKENIZER);
this.maxChunkLimit = parseIntegerParameter(parameters, MAX_CHUNK_LIMIT_FIELD, DEFAULT_MAX_CHUNK_LIMIT);
if (overlapRate < OVERLAP_RATE_LOWER_BOUND || overlapRate > OVERLAP_RATE_UPPER_BOUND) {
throw new IllegalArgumentException(
String.format(
Expand Down Expand Up @@ -121,9 +118,9 @@ public void parseParameters(Map<String, Object> parameters) {
*/
@Override
public List<String> chunk(final String content, final Map<String, Object> runtimeParameters) {
int maxTokenCount = parsePositiveIntegerParameter(runtimeParameters, MAX_TOKEN_COUNT_FIELD, DEFAULT_MAX_TOKEN_COUNT);
int runtimeMaxChunkLimit = parseIntegerParameter(runtimeParameters, MAX_CHUNK_LIMIT_FIELD, this.maxChunkLimit);
int chunkStringCount = parseIntegerParameter(runtimeParameters, CHUNK_STRING_COUNT_FIELD, 1);
int maxTokenCount = parsePositiveIntegerParameter(runtimeParameters, MAX_TOKEN_COUNT_FIELD, null);
int runtimeMaxChunkLimit = parseIntegerParameter(runtimeParameters, MAX_CHUNK_LIMIT_FIELD, null);
int chunkStringCount = parseIntegerParameter(runtimeParameters, CHUNK_STRING_COUNT_FIELD, null);

List<AnalyzeToken> tokens = tokenize(content, tokenizer, maxTokenCount);
List<String> chunkResult = new ArrayList<>();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@

public class DelimiterChunkerTests extends OpenSearchTestCase {

private final Map<String, Object> runtimeParameters = Map.of(MAX_CHUNK_LIMIT_FIELD, 100, CHUNK_STRING_COUNT_FIELD, 1);

public void testCreate_withDelimiterFieldInvalidType_thenFail() {
Exception exception = assertThrows(
IllegalArgumentException.class,
Expand All @@ -37,80 +39,67 @@ public void testCreate_withDelimiterFieldEmptyString_thenFail() {
public void testChunk_withNewlineDelimiter_thenSucceed() {
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n"));
String content = "a\nb\nc\nd";
List<String> chunkResult = chunker.chunk(content, Map.of());
List<String> chunkResult = chunker.chunk(content, runtimeParameters);
assertEquals(List.of("a\n", "b\n", "c\n", "d"), chunkResult);
}

public void testChunk_withDefaultDelimiter_thenSucceed() {
// default delimiter is \n\n
DelimiterChunker chunker = new DelimiterChunker(Map.of());
String content = "a.b\n\nc.d";
List<String> chunkResult = chunker.chunk(content, Map.of());
List<String> chunkResult = chunker.chunk(content, runtimeParameters);
assertEquals(List.of("a.b\n\n", "c.d"), chunkResult);
}

public void testChunk_withOnlyDelimiterContent_thenSucceed() {
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n"));
String content = "\n";
List<String> chunkResult = chunker.chunk(content, Map.of());
List<String> chunkResult = chunker.chunk(content, runtimeParameters);
assertEquals(List.of("\n"), chunkResult);
}

public void testChunk_WithAllDelimiterContent_thenSucceed() {
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n"));
String content = "\n\n\n";
List<String> chunkResult = chunker.chunk(content, Map.of());
List<String> chunkResult = chunker.chunk(content, runtimeParameters);
assertEquals(List.of("\n", "\n", "\n"), chunkResult);
}

public void testChunk_WithPeriodDelimiters_thenSucceed() {
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "."));
String content = "a.b.cc.d.";
List<String> chunkResult = chunker.chunk(content, Map.of());
List<String> chunkResult = chunker.chunk(content, runtimeParameters);
assertEquals(List.of("a.", "b.", "cc.", "d."), chunkResult);
}

public void testChunk_withDoubleNewlineDelimiter_thenSucceed() {
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n\n"));
String content = "\n\na\n\n\n";
List<String> chunkResult = chunker.chunk(content, Map.of());
List<String> chunkResult = chunker.chunk(content, runtimeParameters);
assertEquals(List.of("\n\n", "a\n\n", "\n"), chunkResult);
}

public void testChunk_whenExceedMaxChunkLimit_thenLastPassageGetConcatenated() {
int maxChunkLimit = 2;
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n\n", MAX_CHUNK_LIMIT_FIELD, maxChunkLimit));
String content = "\n\na\n\n\n";
List<String> passages = chunker.chunk(content, Map.of());
List<String> expectedPassages = new ArrayList<>();
expectedPassages.add("\n\n");
expectedPassages.add("a\n\n\n");
assertEquals(expectedPassages, passages);
}

public void testChunk_whenWithinMaxChunkLimit_thenSucceed() {
int maxChunkLimit = 3;
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n\n", MAX_CHUNK_LIMIT_FIELD, maxChunkLimit));
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n\n"));
String content = "\n\na\n\n\n";
List<String> chunkResult = chunker.chunk(content, Map.of());
int runtimeMaxChunkLimit = 3;
List<String> chunkResult = chunker.chunk(content, Map.of(CHUNK_STRING_COUNT_FIELD, 1, MAX_CHUNK_LIMIT_FIELD, runtimeMaxChunkLimit));
assertEquals(List.of("\n\n", "a\n\n", "\n"), chunkResult);
}

public void testChunk_whenExceedRuntimeMaxChunkLimit_thenLastPassageGetConcatenated() {
int maxChunkLimit = 3;
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n\n", MAX_CHUNK_LIMIT_FIELD, maxChunkLimit));
public void testChunk_whenExceedMaxChunkLimit_thenLastPassageGetConcatenated() {
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n\n"));
String content = "\n\na\n\n\n";
int runtimeMaxChunkLimit = 2;
List<String> passages = chunker.chunk(content, Map.of(MAX_CHUNK_LIMIT_FIELD, runtimeMaxChunkLimit));
List<String> passages = chunker.chunk(content, Map.of(CHUNK_STRING_COUNT_FIELD, 1, MAX_CHUNK_LIMIT_FIELD, runtimeMaxChunkLimit));
List<String> expectedPassages = new ArrayList<>();
expectedPassages.add("\n\n");
expectedPassages.add("a\n\n\n");
assertEquals(expectedPassages, passages);
}

public void testChunk_whenExceedRuntimeMaxChunkLimit_withTwoStringsTobeChunked_thenLastPassageGetConcatenated() {
int maxChunkLimit = 3;
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n\n", MAX_CHUNK_LIMIT_FIELD, maxChunkLimit));
public void testChunk_whenExceedMaxChunkLimit_withTwoStringsTobeChunked_thenLastPassageGetConcatenated() {
DelimiterChunker chunker = new DelimiterChunker(Map.of(DELIMITER_FIELD, "\n\n"));
String content = "\n\na\n\n\n";
int runtimeMaxChunkLimit = 2, chunkStringCount = 2;
List<String> passages = chunker.chunk(
Expand Down
Loading
Loading