[FEATURE] Adding default pre/post process function for neural search text embedding model #1304

zane-neo · 2023-09-08T06:59:14Z

Is your feature request related to a problem?
ml-commons has two default pre/post process function which are for OpenAI and Cohere and written in painless script. There's no default pre/post process function for neural search plugin text embedding case, if user want to use neural search with remote model to embed texts, when creating the connector, user has to write complex pre-process(required) painless script like this:

"\n    StringBuilder builder = new StringBuilder();\n    builder.append(\"\\\"\");\n    builder.append(params.text_docs[0]);\n    builder.append(\"\\\"\");\n    def parameters = \"{\" +\"\\\"inputs\\\":\" + builder + \"}\";\n    return  \"{\" +\"\\\"parameters\\\":\" + parameters + \"}\";"

This painless script is to build a parameter map that will be used to substitute placeholders in the connector request body. Post process function is not required but to adapt to the code which extracts the tensors in neural search plugin, user has to write post-process like this:

"\n      def name = \"sentence_embedding\";\n      def dataType = \"FLOAT32\";\n      if (params.vectors == null || params.vectors.length == 0) {\n          return null;\n      }\n      def shape = [params.vectors.length];\n      def json = \"{\" +\n                 \"\\\"name\\\":\\\"\" + name + \"\\\",\" +\n                 \"\\\"data_type\\\":\\\"\" + dataType + \"\\\",\" +\n                 \"\\\"shape\\\":\" + shape + \",\" +\n                 \"\\\"data\\\":\" + params.vectors +\n                 \"}\";\n      return json;\n    "

As we can see from the above example, it's not an easy task to write either the pre process or post process function.

What solution would you like?
If user follows the suggested format in model serving side including the model input and output data structure, it's possible to provide a default pre/post process function for user.

Suggested format
Suggested model input format should be list of string, an example would be:

["hello", "world"]

Suggested model output format should be a two dimension array, each inner element represents the embedding result of a input text, E.g.:

[
  [
    1.0,
    2.0
  ],
  [
    3.0,
    4.0
  ]
]

Suggested request body template is: "request_body": "${parameters.input}".

Default process functions
With these premise, user can use default process functions: connector.pre_process.neural_search.text_embedding and connector.post_process.neural_search.text_embedding instead of painless script. The default pre-process function will parse the neural search input text docs to model input and the default post-process function will parse the model response to ModelTensorOutput.

What alternatives have you considered?
NA

Do you have any additional context?
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

hdhalter · 2023-09-21T20:02:57Z

Hi all, Please create a doc issue or PR ASAP if this has doc implications for 2.11. Thanks.

zane-neo · 2023-09-25T06:53:31Z

Hi @hdhalter , I've created this doc issue: opensearch-project/documentation-website#5081.

hdhalter · 2023-10-05T21:15:36Z

@zane-neo - Can you please confirm that this feature has been moved to 2.12? Thanks.

zane-neo · 2023-10-06T00:30:50Z

@hdhalter From my knowledge, this is still a 2.11 feature.

ylwu-amzn · 2023-10-18T21:18:43Z

This feature released in 2.11

hdhalter · 2023-11-06T22:12:37Z

Can you please update the release train? It is showing up in the 2.12 roadmap. Thanks!

zane-neo added enhancement New feature or request untriaged labels Sep 8, 2023

zane-neo self-assigned this Sep 8, 2023

zane-neo removed the untriaged label Sep 11, 2023

zane-neo mentioned this issue Sep 11, 2023

Add neural search default processor for non OpenAI/Cohere scenario #1274

Merged

5 tasks

ylwu-amzn added this to ml-commons projects Sep 15, 2023

ylwu-amzn moved this to In Progress in ml-commons projects Sep 15, 2023

zane-neo mentioned this issue Sep 25, 2023

[DOC] Add default text_docs input pre/post processor docs in ml-commons opensearch-project/documentation-website#5081

Closed

4 tasks

model-collapse moved this from In Progress to Done in ml-commons projects Oct 7, 2023

ylwu-amzn closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Adding default pre/post process function for neural search text embedding model #1304

[FEATURE] Adding default pre/post process function for neural search text embedding model #1304

zane-neo commented Sep 8, 2023 •

edited

Loading

hdhalter commented Sep 21, 2023

zane-neo commented Sep 25, 2023

hdhalter commented Oct 5, 2023

zane-neo commented Oct 6, 2023

ylwu-amzn commented Oct 18, 2023

hdhalter commented Nov 6, 2023

[FEATURE] Adding default pre/post process function for neural search text embedding model #1304

[FEATURE] Adding default pre/post process function for neural search text embedding model #1304

Comments

zane-neo commented Sep 8, 2023 • edited Loading

hdhalter commented Sep 21, 2023

zane-neo commented Sep 25, 2023

hdhalter commented Oct 5, 2023

zane-neo commented Oct 6, 2023

ylwu-amzn commented Oct 18, 2023

hdhalter commented Nov 6, 2023

zane-neo commented Sep 8, 2023 •

edited

Loading