Add text embedding processor to neural search #18

zane-neo · 2022-10-12T02:34:36Z

Signed-off-by: Zan Niu [email protected]

Description

[Describe what this change achieves]

Issues Resolved

[List any issues this PR will resolve]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

build.gradle

navneet1v · 2022-10-12T18:28:38Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+        } else {
+            this.fieldMap = fieldMap;
+        }
+        this.mlCommonsClientAccessor = new MLCommonsClientAccessor(new MachineLearningNodeClient(client));


You don't need to create this, you can use @Inject Annotation to inject the MLCommonsClientAccessor. We are creating this via CreateComponents function from the Plugin class.

New Class is not a best practice indeed, but Processors are created by Factory instead of injection, also the NeuralSearch plugin needs an implementation of IngestionPlugin, check below code in NeuralSearch:

@Override public Map<String, Processor.Factory> getProcessors(Processor.Parameters parameters) { return Collections.singletonMap(TextEmbeddingProcessor.TYPE, new TextEmbeddingProcessor.Factory(parameters.client)); }

We need to return a factory map, and the instance creation happens out of our code by invoking the factory.create automatically, and a @inject field won't be initialized correctly.

even with the

@Override public Map<String, Processor.Factory> getProcessors(Processor.Parameters parameters) { return Collections.singletonMap(TextEmbeddingProcessor.TYPE, new TextEmbeddingProcessor.Factory(parameters.client)); }

You can use the same instance that we have created using createComponents. I would say rather than passing client in the new TextEmbeddingProcessor.Factory(parameters.client) pass the MLCommonsAccessor.

navneet1v · 2022-10-12T18:34:25Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+        ActionListener<List<List<Float>>> internalListener = ActionListener.wrap(
+            responseConsumer(ingestDocument, knnMap),
+            exceptionConsumer()
+        );
+
+        mlCommonsClientAccessor.inferenceSentences(this.modelId, buildMLInput(knnMap), internalListener);
+        return ingestDocument;


I am little confused here, if we are returning the ingestDocument back just after calling the mlCommonsClientAccessor.inferenceSentences, how does the document is getting updated? as the call is async call.

shouldn't we be doing the internalListener.onResponse kind of thing?

Yes, please check the responseConsumer, it's a CheckedConsumer and being passed to the ActionListener, the onResponse invocation will invoke the function created in responseConsumer.

the question was around, does this function execution will be stopped until we get the inference response?

navneet1v · 2022-10-12T18:36:08Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+    Consumer<Exception> exceptionConsumer() {
+        return exception -> log.error(exception.getMessage(), exception);
+    }


we are logging the exception, shouldn't we fail the ingestion request for that document.

User has the flexibility to determine what to do when error arise, please refer to this: https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html#handling-pipeline-failures. And a processor itself doesn't have the capability to prevent document ingestion.

The use can set the tags of ignore_failure to proceed. But at least we need to announce the failure if they choose not to ignore_failure.

Even if we want to announce the failures, we should put a proper message instead of saying exception.getMessage().

navneet1v · 2022-10-12T18:37:12Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+    }
+
+    @SuppressWarnings({ "unchecked" })
+    private List<String> buildMLInput(Map<String, Object> knnMap) {


rename this function to represent what it is doing, I can see that it is creating the sentences list which needs to be inferenced

navneet1v · 2022-10-12T18:38:44Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+    @VisibleForTesting
+    CheckedConsumer<List<List<Float>>, Exception> responseConsumer(IngestDocument ingestDocument, Map<String, Object> knnMap) {
+        return res -> {
+            Objects.requireNonNull(res, "embedding failed!");


not enough context in the message.

This is to check inference return value, null without exception is a rare case. Added a little more info to make the error more clear.

navneet1v · 2022-10-12T18:39:25Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+        return TYPE;
+    }
+
+    public static final class Factory implements Processor.Factory {


can move this class out to a different file and rename the class to TextEmbeddingProcessorFactory.

It seems this is a standard style of processor factory, this is an example: https://github.com/opensearch-project/OpenSearch/blob/d15795a7aca488c1fadb04b3c8d9f1a3b02e4056/modules/ingest-common/src/main/java/org/opensearch/ingest/common/SetProcessor.java#L115

This is the opensearch style, because the factory code is little and the logic of processor creation is extremely related to the ingestion logic.

Just a quick comment its not an OpenSearch style. Given the size of Processor class which you have provided as example it might be fine to put in the same file, but the current TextEmbeddingProcessor class is big enough which already makes it difficult to read. So, as best practice let's move Factory class out of this file.

You can put the class under src/main/java/org/opensearch/neuralsearch/processor/factory.

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

navneet1v · 2022-10-12T19:52:41Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+
+    @VisibleForTesting
+    Consumer<Exception> exceptionConsumer() {
+        return exception -> log.error(exception.getMessage(), exception);


we should enhance the log message

navneet1v · 2022-10-12T19:53:23Z

Signed-off-by: Zan Niu [email protected]

Description

[Describe what this change achieves]

Issues Resolved

[List any issues this PR will resolve]

Check List

New functionality includes testing.

All tests pass

New functionality has been documented.

New functionality has javadoc added

Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Update the description and issue link.

navneet1v · 2022-10-13T05:52:36Z

Signed-off-by: Zan Niu [email protected]

Description

[Describe what this change achieves]

Issues Resolved

[List any issues this PR will resolve]

Check List

New functionality includes testing.

All tests pass

New functionality has been documented.

New functionality has javadoc added

Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Update the description and issue link.

@zane-neo please fix this

build.gradle

navneet1v · 2022-10-13T06:06:08Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+
+@Log4j2
+public class TextEmbeddingProcessor extends AbstractProcessor {


add Java Doc on all the public functions and public classes in main/java.

navneet1v · 2022-10-13T06:08:25Z

src/main/java/org/opensearch/neuralsearch/plugin/NeuralSearch.java

+
+    @Override
+    public Map<String, Processor.Factory> getProcessors(Processor.Parameters parameters) {
+        return Collections.singletonMap(TextEmbeddingProcessor.TYPE, new TextEmbeddingProcessor.Factory(parameters.client));


rather than passing client here, pass the MLCommonsClientAccessor which we are creating via createComponents. You can achieve this by creating a class variable for MLCommonsClientAccessor and then passing the same variable while creating the TextEmbeddingProcessor

navneet1v · 2022-10-13T06:10:46Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+
+    public TextEmbeddingProcessor(String tag, String description, String modelId, Map<String, Object> fieldMap, Client client) {
+        super(tag, description);
+        this.modelId = Objects.requireNonNull(modelId, "model_id is null, can not process it");


we should validate modelId for null and empty string.

navneet1v · 2022-10-13T06:12:26Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+        super(tag, description);
+        this.modelId = Objects.requireNonNull(modelId, "model_id is null, can not process it");
+        if (fieldMap == null || fieldMap.size() == 0 || checkEmbeddingConfigNotValid(fieldMap)) {
+            throw new IllegalArgumentException("filed_map is null, can not process it");


Should we say something like this in error message:

Unable to create the TextEmbedding processor as field_map is null or empty.

I would recommend putting more thoughts on each and every error message that is getting generated as those messages are directly read by users.

Changed to debug level.

navneet1v · 2022-10-13T06:25:28Z

src/test/java/org/opensearch/neuralsearch/utils/TestHelper.java

+        return state(new ClusterName("test"), indexName, mapping, clusterManagerNode, clusterManagerNode, allNodes);
+    }
+
+    public static ClusterState setupTestClusterState() {


move this to a base class, as suggested for upload and loadModel functions.

navneet1v · 2022-10-13T06:26:42Z

src/test/java/org/opensearch/neuralsearch/utils/TestHelper.java

+    public static RestRequest getStatsRestRequest() {
+        RestRequest request = new FakeRestRequest.Builder(getXContentRegistry()).build();
+        return request;
+    }
+
+    public static RestRequest getStatsRestRequest(String nodeId, String stat) {
+        RestRequest request = new FakeRestRequest.Builder(getXContentRegistry()).withParams(ImmutableMap.of("nodeId", nodeId, "stat", stat))
+            .build();
+        return request;
+    }


not able to see the usage of these 2 functions

navneet1v · 2022-10-13T06:28:18Z

src/test/java/org/opensearch/neuralsearch/utils/TestHelper.java

+        return makeRequest(client, method, endpoint, params, entity, headers, false);
+    }
+
+    public static Response makeRequest(


there are already created functions in OpenSearchRestTestCase class, try to look into them. If you are still not able to find it, move these functions to base class as suggested for upload and loadModel functions, so that all IT can take benefit of.

Didn't found the function similar to this makeRequest, moved upload and loadModel and this function to BaseNeuralSearchIT.class

navneet1v · 2022-10-13T06:29:06Z

src/test/java/org/opensearch/neuralsearch/utils/TestHelper.java

+    public static RestStatus restStatus(Response response) {
+        return RestStatus.fromCode(response.getStatusLine().getStatusCode());
+    }


not able to see usage of this function.

navneet1v · 2022-10-13T06:30:03Z

src/test/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessorIT.java

+        Response response = TestHelper.makeRequest(
+            client(),
+            "POST",
+            indexName + "/_doc",
+            null,
+            TestHelper.toHttpEntity(
+                FileUtils.readFileToString(new File(classLoader.getResource("processor/IngestDocument.json").getFile()), "utf-8")
+            ),
+            ImmutableList.of(new BasicHeader(HttpHeaders.USER_AGENT, "Kibana"))
+        );
+        JsonNode node = objectMapper.readTree(EntityUtils.toString(response.getEntity()));


check other Integration tests on how to read the responses.

Changed to XContent approach.

jmazanec15 · 2022-10-13T22:27:57Z

src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java

+     * @throws ExecutionException If the underlying task failed, this exception will be thrown in the future.get().
+     * @throws InterruptedException If the thread is interrupted, this will be thrown.
+     */
+    public List<List<Float>> blockingInferenceSentences(@NonNull final String modelId, @NonNull final List<String> inputText)


Why explicitly call this "blockingInferenceSentences"? From what I have seen in OpenSearch, blocking versus non-blocking distinction is made by whether a listener is passed as an argument.

Checkout OpenSearchClient

Change the naming.

build.gradle

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

jmazanec15 · 2022-10-13T22:38:20Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+        this.mlCommonsClientAccessor = clientAccessor;
+    }
+
+    private boolean checkEmbeddingConfigNotValid(Map<String, Object> fieldMap) {


nit: isEmbeddingConfigValid (negate output) -> related post https://softwareengineering.stackexchange.com/questions/196830/boolean-method-naming-affirmative-vs-negative

jmazanec15 · 2022-10-13T22:46:34Z

src/test/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessorIT.java

+
+    private static final Locale locale = Locale.getDefault();
+
+    public void test_text_embedding_processor() throws Exception {


Camelcase for each component in test name (suggestion from Google style guide) -> testTextEmbeddingProcessor

If there are multiple tests, then we can separate components with _. For example: testTextEmbeddingProcessor_whenInputInvalid_thenThrowException

Changed tests method names to google style.

jmazanec15 · 2022-10-13T22:47:09Z

src/test/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessorIT.java

+        return uploadModel(requestBody);
+    }
+
+    private void createPipelineProcessor(String modelId) throws Exception {


Should this be moved to BaseClass so other ITs can use it?

jmazanec15 · 2022-10-13T22:47:47Z

src/test/java/org/opensearch/neuralsearch/utils/TestHelper.java

+package org.opensearch.neuralsearch.utils;
+
+public class TestHelper {
+


Whats this for?

jmazanec15 · 2022-10-13T22:49:09Z

src/test/resources/processor/UploadModelRequestBody.json

+    "framework_type": "sentence_transformers",
+    "all_config": "{\"architectures\":[\"BertModel\"],\"max_position_embeddings\":512,\"model_type\":\"bert\",\"num_attention_heads\":12,\"num_hidden_layers\":6}"
+  },
+  "url": "https://api.quip-amazon.com/2/blob/MdZ9AAsfqat/y-6nBQpg6Ma_UEE3pYt2NQ?name=all-MiniLM-L6-v2.zip&oauth_token=TUhMOU1BV1gwWUE%3D%7C1695445530%7CNN8X0Y0SQ0NfJvMxNZCnumpJaurxCDaT%2FdK70Al%2Bgh0%3D&s=YkW8AVqTosiF"


Is this link available outside of amazon? If no, we cannot embed it here.

@ylwu-amzn , please chime in and offer help on this.

This is not a public link. Suggest put the model inside your test resource folder, then use local file url.

src/test/resources/processor/UploadModelRequestBody.json

ylwu-amzn · 2022-10-15T01:05:47Z

How about adding more details in description? Suggest adding some examples there so people know what feature this PR is building.

navneet1v · 2022-10-15T02:10:24Z

src/test/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessorIT.java

+    private static final String indexName = "text_embedding_index";
+
+    private static final ObjectMapper objectMapper = new ObjectMapper();


make static final variable UPPER_CASE, that is a general convention for all static final variables

navneet1v · 2022-10-15T02:10:39Z

src/test/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessorIT.java

+
+    private static final ObjectMapper objectMapper = new ObjectMapper();
+
+    private static final Locale locale = Locale.getDefault();


This variable is not used.

navneet1v · 2022-10-15T02:11:03Z

src/test/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessorTests.java

@@ -95,6 +103,25 @@ public void testInferenceSentences_whenExceptionFromMLClient_thenFailure() {
        Mockito.verifyNoMoreInteractions(resultListener);
    }

+    public void test_blockingInferenceSentences() throws ExecutionException, InterruptedException {


use @SneakyThrows.

src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java

src/main/java/org/opensearch/neuralsearch/plugin/NeuralSearch.java

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

navneet1v · 2022-10-15T02:18:26Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+        return numbers;
+    }
+
+    private static void validateEmbeddingFieldsType(IngestDocument ingestDocument, Map<String, Object> embeddingFields) {


any reason for making this and other functions static? if no please make them no static.

Static functions means this method do not depends on any stateful data, so other methods can use this without concerns of the state changing. Please let me know your thoughts on making it no static. Thanks.

So, this function is not dependent on any stateful correct that is correct, but it is always used in the context of a Stateful and that too only in this class.

So, seems like with your assumption if a function is not dependent on a state it should always be static. That seems to be an overkill. Static function lives in JVM heap and generally created to keep data in heap and make them run faster. We don't have any of the case here.

If this function was present in another class then we might want to make this function static as we don't want to create objects at runtime as object creation is expensive. But that is not the case here.

Static function lives in JVM heap and generally created to keep data in heap and make them run faster. We don't have any of the case here.
If this function was present in another class then we might want to make this function static as we don't want to create objects at runtime as object creation is expensive. But that is not the case here.

The reducing of time complexity is the benefit we can get from a static method instead of the unnecessary of object creation, please refer to: https://stackoverflow.com/a/135038. This running faster itself is a benefit for us.

navneet1v · 2022-10-15T02:27:58Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+        knnMap.entrySet().stream().filter(knnMapEntry -> knnMapEntry.getValue() != null).forEach(knnMapEntry -> {
+            Object sourceValue = knnMapEntry.getValue();
+            if (sourceValue instanceof List) {
+                ((List<String>) sourceValue).stream().filter(StringUtils::isNotBlank).forEach(texts::add);


can we add a comment here, which tells what when we will be building the output vector list for this list, we need to add empty vector list to define the string was empty hence no text embeddings.

The final decision is: null or empty string value in a list will cause exception.

navneet1v · 2022-10-15T02:33:35Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+            if (StringUtils.isNotBlank(strSourceValue)) {
+                numbers.add(ImmutableMap.of(LIST_TYPE_NESTED_MAP_KEY, modelTensorList.get(indexWrapper.index++)));
+            }


if we remove blank strings here, then the order vector lists will be wrong. Example:

Input ["This is"," ","Hello world"] Output vectors in document: [{"vector":[1.0,2.0]},{"vector":[4.0,3.0]}]

This seems to be wrong. I also think we add an empty vector list here that will cause the exception. Can we go with vector list of 0.0 values ?

@jmazanec15 do we see any issue with 0.0 value? Will is cause any impact on the K-NN?

@ylwu-amzn if we pass a blank string or a string with spaces to ML Algorithms what will happen? will it fail? I think ans is no. want to confirm.

Right, I think this method should not skip strings - instead, it should get what was passed to it. k-NN may have issues with an all 0.0 vector for cosine similarity type, but I think it should work. Remember:

Cos(x, y) = x . y / ||x|| * ||y||

src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java

navneet1v · 2022-10-15T03:10:40Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

+        while (!isComplete) {
+            taskQueryResult = getTaskQueryResponse(taskId);
+            isComplete = checkComplete(taskQueryResult);


This is a very brute force way of checking if model is uploaded or not. This is very resource intensive plus there is no time out which makes can make it run forever and just increase the build time.

Suggestion:

We should move this checking of model upload to another thread.

Add a timeout(3 times than the actual time, or let say 1 min) after which we will fail the test saying that model is not uploaded, and provide proper response why the tests failed and add things like increase the time out and other things.

To avoid the resource intensive work, we should also provide some sleep time in the thread which is checking the model upload if we apply "1" suggestion.

Add max retry time and sleep time in the thread.

navneet1v · 2022-10-15T04:28:07Z

src/test/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessorIT.java

+                builder.startObject()
+                    .startObject("index")
+                    .field("knn", true)
+                    .field("knn.algo_param.ef_search", 100)
+                    .field("refresh_interval", "30s")
+                    .field("default_pipeline", pipelineName)
+                    .endObject()
+                    .field("number_of_shards", 1)
+                    .field("number_of_replicas", 0)
+                    .endObject()
+                    .endObject()


I tried running this test it failed at the line 61. I see that we are adding 1 more extra endObject which is not required.

Command to rerun the test:

./gradlew ':integTest' --tests "org.opensearch.neuralsearch.processor.TextEmbeddingProcessorIT.test_text_embedding_processor" -Dtests.seed=173A58A0D4C0E3A0 -Dtests.security.manager=false -Dtests.locale=es-SV -Dtests.timezone=Asia/Seoul

navneet1v · 2022-10-15T04:29:41Z

src/test/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessorIT.java

+            createIndex(
+                indexName,
+                Settings.builder().loadFromSource(settings, XContentType.JSON).build(),
+                Files.readString(Path.of(classLoader.getResource("processor/IndexMappings.json").toURI()))


as we are reading a JSON object from file, while reading the object you might want to remove "{" and "}" from start and end. I tried removing then it worked for.

Example reference: https://github.com/opensearch-project/k-NN/blob/48d2303ad6964d386709ab5ae5fdbb0965420cb8/src/testFixtures/java/org/opensearch/knn/KNNRestTestCase.java#L638-L639

This method encapsulation is really not friendly to use, I prefer to create our own method.

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

zane-neo · 2022-10-17T00:51:16Z

src/main/java/org/opensearch/neuralsearch/plugin/NeuralSearch.java

+    @Override
+    public Map<String, Processor.Factory> getProcessors(Processor.Parameters parameters) {
+        final MachineLearningNodeClient machineLearningNodeClient = new MachineLearningNodeClient(parameters.client);
+        mlCommonsClientAccessor = new MLCommonsClientAccessor(machineLearningNodeClient);


We need to create the MLCommonsClientAccessor in the getProcessors method since this method executes before createComponents, check this line: https://github.com/opensearch-project/OpenSearch/blob/e44158d4d10d4f8905895ffa50bf9398b8550667/server/src/main/java/org/opensearch/node/Node.java#L515, and this line: https://github.com/opensearch-project/OpenSearch/blob/e44158d4d10d4f8905895ffa50bf9398b8550667/server/src/main/java/org/opensearch/node/Node.java#L711.

jmazanec15 · 2022-10-17T21:02:58Z

src/main/java/org/opensearch/neuralsearch/processor/factory/TextEmbeddingProcessorFactory.java

+        Map<String, Object> config
+    ) throws Exception {
+        String modelId = readStringProperty(TYPE, processorTag, config, MODEL_ID_FIELD);
+        Map<String, Object> filedMap = readOptionalMap(TYPE, processorTag, config, FIELD_MAP_FIELD);


Why optional? If this doesnt exist, what is the purpose of the processor?

Changed to non optional method, this will throw a configuration exception when this map is missing.

jmazanec15 · 2022-10-17T21:06:07Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+        this.mlCommonsClientAccessor = clientAccessor;
+    }
+
+    private boolean isEmbeddingConfigValid(Map<String, Object> fieldMap) {


It should return true if it is valid. False otherwise.

Also, why not make this a validation method (i.e. validateEmbeddingConfig or checkEmbeddingConfig) and have it throw an IllegalArgumentException, similar to the ones at the end of this class (checkListElementsType, validateEmbeddingFieldsType)

We can add the following to the method

fieldMap == null || fieldMap.size() == 0

jmazanec15 · 2022-10-17T21:32:59Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+            if (StringUtils.isNotBlank(strSourceValue)) {
+                numbers.add(ImmutableMap.of(LIST_TYPE_NESTED_MAP_KEY, modelTensorList.get(indexWrapper.index++)));
+            }


Right, I think this method should not skip strings - instead, it should get what was passed to it. k-NN may have issues with an all 0.0 vector for cosine similarity type, but I think it should work. Remember:

Cos(x, y) = x . y / ||x|| * ||y||

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

jmazanec15 · 2022-10-19T17:01:07Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+    }
+
+    private void validateEmbeddingConfiguration(Map<String, Object> fieldMap) {
+        if (fieldMap == null || fieldMap.size() == 0 || fieldMap.entrySet()


I am wondering if a user can create an arbitrarily large fieldMap? If so, should this be limited, from a security perspective?

jmazanec15 · 2022-10-19T17:03:43Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+                String sourceKey = embeddingFieldsEntry.getKey();
+                Class<?> sourceValueClass = sourceValue.getClass();
+                if (List.class.isAssignableFrom(sourceValueClass) || Map.class.isAssignableFrom(sourceValueClass)) {
+                    validateNestedTypeValue(sourceKey, sourceValue);


Related to above comment: are there any limits on the depth of nested parameter that could be passed?

For now, there's no limits. Nested type will not be explicitly mentioned in the doc, so users use only raw string or list type. If we receive feedback on supporting nested type, we can then tell user how.

jmazanec15 · 2022-10-19T17:04:31Z

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

+    }
+
+    @VisibleForTesting
+    Map<String, Object> buildMapWithKnnKeyAndOriginalValue(IngestDocument ingestDocument, Map<String, Object> fieldMap) {


private method: do we need to pass fieldMap if its a member?

Removed this from parameter.

# This is the 1st commit message: Add text embedding processor to neural search Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#2 will be skipped: # Code format # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#3 will be skipped: # Address review comments # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#4 will be skipped: # Add blocking text embedding method for pipeline processor # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#5 will be skipped: # Add BaseNeuralSearchIT and address other review comments # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#6 will be skipped: # Add BaseNeuralSearchIT and address other review comments # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#7 will be skipped: # Add BaseNeuralSearchIT and address other review comments # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#8 will be skipped: # Fix naming convention and IT function move to base # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#9 will be skipped: # Fix naming convention and IT function move to base # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#10 will be skipped: # Update src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java # # Co-authored-by: Navneet Verma <[email protected]> # The commit message opensearch-project#11 will be skipped: # Update src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java # # Co-authored-by: Navneet Verma <[email protected]> # The commit message opensearch-project#12 will be skipped: # Fix code review comments # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#13 will be skipped: # Fix text embedding processor NPE # # Signed-off-by: Zan Niu <[email protected]> # The commit message opensearch-project#14 will be skipped: # Remove jackson dependencies and fix tests with XCoontent # # Signed-off-by: Zan Niu <[email protected]>

Signed-off-by: Zan Niu <[email protected]>

model-collapse

LGTM

zane-neo requested review from a team, navneet1v and ylwu-amzn October 12, 2022 02:34

navneet1v reviewed Oct 12, 2022

View reviewed changes

zane-neo requested a review from model-collapse October 13, 2022 01:54

navneet1v reviewed Oct 13, 2022

View reviewed changes

jmazanec15 reviewed Oct 13, 2022

View reviewed changes

navneet1v reviewed Oct 15, 2022

View reviewed changes

src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java Outdated Show resolved Hide resolved

navneet1v reviewed Oct 15, 2022

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java Outdated Show resolved Hide resolved

zane-neo commented Oct 17, 2022

View reviewed changes

jmazanec15 reviewed Oct 17, 2022

View reviewed changes

jmazanec15 reviewed Oct 19, 2022

View reviewed changes

zane-neo added 2 commits October 20, 2022 09:22

Add text embedding processor to neural search

e3cca7b

Signed-off-by: Zan Niu <[email protected]>

zane-neo force-pushed the text-embedding-processor branch from 117249c to e3cca7b Compare October 20, 2022 01:26

zane-neo added 3 commits October 20, 2022 09:35

Remove unnecessary parameters in TextEmbeddingProcessor method

df266ce

Signed-off-by: Zan Niu <[email protected]>

Remove unnecessary empty string checks

62702bf

Signed-off-by: Zan Niu <[email protected]>

Add field max depth limit to prevent malicious attack

9fadcb8

Signed-off-by: Zan Niu <[email protected]>

jmazanec15 approved these changes Oct 20, 2022

View reviewed changes

model-collapse approved these changes Oct 20, 2022

View reviewed changes

zane-neo merged commit 799c402 into opensearch-project:main Oct 20, 2022

jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Nov 3, 2022


		@Log4j2
		public class TextEmbeddingProcessor extends AbstractProcessor {


		private static final Locale locale = Locale.getDefault();

		public void test_text_embedding_processor() throws Exception {

		package org.opensearch.neuralsearch.utils;

		public class TestHelper {

		private static final String indexName = "text_embedding_index";

		private static final ObjectMapper objectMapper = new ObjectMapper();

Add text embedding processor to neural search #18

Add text embedding processor to neural search #18

Conversation

zane-neo commented Oct 12, 2022 • edited Loading

Description

Issues Resolved

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v commented Oct 12, 2022

Description

Issues Resolved

Check List

navneet1v commented Oct 13, 2022 • edited by zane-neo Loading

Description

Issues Resolved

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn commented Oct 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v Oct 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zane-neo commented Oct 12, 2022 •

edited

Loading

navneet1v commented Oct 13, 2022 •

edited by zane-neo

Loading

navneet1v Oct 15, 2022 •

edited

Loading