Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Adding default pre/post process function for neural search text embedding model #1304

Closed
zane-neo opened this issue Sep 8, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@zane-neo
Copy link
Collaborator

zane-neo commented Sep 8, 2023

Is your feature request related to a problem?
ml-commons has two default pre/post process function which are for OpenAI and Cohere and written in painless script. There's no default pre/post process function for neural search plugin text embedding case, if user want to use neural search with remote model to embed texts, when creating the connector, user has to write complex pre-process(required) painless script like this:

"\n    StringBuilder builder = new StringBuilder();\n    builder.append(\"\\\"\");\n    builder.append(params.text_docs[0]);\n    builder.append(\"\\\"\");\n    def parameters = \"{\" +\"\\\"inputs\\\":\" + builder + \"}\";\n    return  \"{\" +\"\\\"parameters\\\":\" + parameters + \"}\";"

This painless script is to build a parameter map that will be used to substitute placeholders in the connector request body. Post process function is not required but to adapt to the code which extracts the tensors in neural search plugin, user has to write post-process like this:

"\n      def name = \"sentence_embedding\";\n      def dataType = \"FLOAT32\";\n      if (params.vectors == null || params.vectors.length == 0) {\n          return null;\n      }\n      def shape = [params.vectors.length];\n      def json = \"{\" +\n                 \"\\\"name\\\":\\\"\" + name + \"\\\",\" +\n                 \"\\\"data_type\\\":\\\"\" + dataType + \"\\\",\" +\n                 \"\\\"shape\\\":\" + shape + \",\" +\n                 \"\\\"data\\\":\" + params.vectors +\n                 \"}\";\n      return json;\n    "

As we can see from the above example, it's not an easy task to write either the pre process or post process function.

What solution would you like?
If user follows the suggested format in model serving side including the model input and output data structure, it's possible to provide a default pre/post process function for user.

Suggested format
Suggested model input format should be list of string, an example would be:

["hello", "world"]

Suggested model output format should be a two dimension array, each inner element represents the embedding result of a input text, E.g.:

[
  [
    1.0,
    2.0
  ],
  [
    3.0,
    4.0
  ]
]

Suggested request body template is: "request_body": "${parameters.input}".

Default process functions
With these premise, user can use default process functions: connector.pre_process.neural_search.text_embedding and connector.post_process.neural_search.text_embedding instead of painless script. The default pre-process function will parse the neural search input text docs to model input and the default post-process function will parse the model response to ModelTensorOutput.

What alternatives have you considered?
NA

Do you have any additional context?
Add any other context or screenshots about the feature request here.

@hdhalter
Copy link

Hi all, Please create a doc issue or PR ASAP if this has doc implications for 2.11. Thanks.

@zane-neo
Copy link
Collaborator Author

Hi @hdhalter , I've created this doc issue: opensearch-project/documentation-website#5081.

@hdhalter
Copy link

hdhalter commented Oct 5, 2023

@zane-neo - Can you please confirm that this feature has been moved to 2.12? Thanks.

@zane-neo
Copy link
Collaborator Author

zane-neo commented Oct 6, 2023

@hdhalter From my knowledge, this is still a 2.11 feature.

@model-collapse model-collapse moved this from In Progress to Done in ml-commons projects Oct 7, 2023
@ylwu-amzn
Copy link
Collaborator

This feature released in 2.11

@hdhalter
Copy link

hdhalter commented Nov 6, 2023

Can you please update the release train? It is showing up in the 2.12 roadmap. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

3 participants