New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add more parameters for text embedding model #640

Merged

ylwu-amzn merged 3 commits into opensearch-project:2.x from ylwu-amzn:2.5_fixbug

Dec 20, 2022

Collaborator

ylwu-amzn commented Dec 19, 2022

Signed-off-by: Yaliang Wu [email protected]

Description

Add more parameters for text embedding model

model max length: how many tokens the model can support at most
pooling method: we only support mean pooling method in 2.4 release, this PR add cls pooling support
normalize result: boolean, will normalize result if this is true.

Issues Resolved

[List any issues this PR will resolve]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.


          add more parameters for text embedding model

b3a3da4

Signed-off-by: Yaliang Wu <[email protected]>

ylwu-amzn requested a review from a team

December 19, 2022 22:49


          upgrade junit version to 4.13.2

4ac7c26

Signed-off-by: Yaliang Wu <[email protected]>

ylwu-amzn added the enhancement label

rbhavna previously approved these changes

View reviewed changes

jngz-es reviewed

View reviewed changes

common/src/main/java/org/opensearch/ml/common/model/TextEmbeddingModelConfig.java Outdated

                               case ALL_CONFIG_FIELD:
                                   allConfig = parser.text();
                                   break;
+                              case POOLING_METHOD_FIELD:
+                                  poolingMethod = PoolingMethod.from(parser.text().toUpperCase());

Collaborator

jngz-es Dec 20, 2022

Could you add Locale.ROOT same as FRAMEWORK_TYPE_FIELD?

Collaborator Author

ylwu-amzn Dec 20, 2022

sure, will add this

common/src/main/java/org/opensearch/ml/common/model/TextEmbeddingModelConfig.java

                   }
                   public static TextEmbeddingModelConfig parse(XContentParser parser) throws IOException {
                       String modelType = null;
                       Integer embeddingDimension = null;
                       FrameworkType frameworkType = null;
                       String allConfig = null;
+                      PoolingMethod poolingMethod = null;
+                      boolean normalizeResult = false;
+                      Integer modelMaxLength = null;

Collaborator

jngz-es Dec 20, 2022

Why don't we have a default value for modelMaxLength?

Collaborator Author

ylwu-amzn Dec 20, 2022 •

edited

Loading

We depend on DJL engine to set the default value.

common/src/main/java/org/opensearch/ml/common/model/TextEmbeddingModelConfig.java Outdated

                   }
                   public static TextEmbeddingModelConfig parse(XContentParser parser) throws IOException {
                       String modelType = null;
                       Integer embeddingDimension = null;
                       FrameworkType frameworkType = null;
                       String allConfig = null;
+                      PoolingMethod poolingMethod = null;

Collaborator

jngz-es Dec 20, 2022

Set the default value PoolingMethod.MEAN here?

Collaborator Author

ylwu-amzn Dec 20, 2022

Actually the constructor will set default value. I will set default value here to make it more clear.

common/src/main/java/org/opensearch/ml/common/model/TextEmbeddingModelConfig.java

Comment on lines +60 to +62

+                      } else {
+                          this.poolingMethod = PoolingMethod.MEAN;
+                      }

Collaborator

jngz-es Dec 20, 2022

If we set the default value below, We don't need this else branch.

Collaborator Author

ylwu-amzn Dec 20, 2022

I guess you mean if we set default value in line 72 PoolingMethod poolingMethod = null;, we don't need line 61 this.poolingMethod = PoolingMethod.MEAN; ?

I think we still need this. This is constructor method, user can create a new object directly without calling parse method

common/src/main/java/org/opensearch/ml/common/model/TextEmbeddingModelConfig.java Outdated

+                          try {
+                              return PoolingMethod.valueOf(value);
+                          } catch (Exception e) {
+                              throw new IllegalArgumentException("Wrong framework type");

Collaborator

jngz-es Dec 20, 2022

Copy error?

Collaborator Author

ylwu-amzn Dec 20, 2022

Good catch, will fix

...pensearch/ml/engine/algorithms/text_embedding/HuggingfaceTextEmbeddingTranslatorFactory.java

                       SUPPORTED_TYPES.add(new Pair<>(Input.class, Output.class));
                   }
+                  private final TextEmbeddingModelConfig.PoolingMethod poolingMethod;
+                  private boolean normalizeResult;

Collaborator

jngz-es Dec 20, 2022

final?

...hms/src/main/java/org/opensearch/ml/engine/algorithms/text_embedding/TextEmbeddingModel.java

Comment on lines 196 to 208

                                       if (ONNX_ENGINE.equals(engine)) { //ONNX
-                                          criteriaBuilder.optTranslator(new ONNXSentenceTransformerTextEmbeddingTranslator());
+                                          criteriaBuilder.optTranslator(new ONNXSentenceTransformerTextEmbeddingTranslator(poolingMethod, normalizeResult, modelType));
                                       } else { // pytorch
                                           if (transformersType == SENTENCE_TRANSFORMERS) {
                                               criteriaBuilder.optTranslator(new SentenceTransformerTextEmbeddingTranslator());
                                           } else {
-                                              criteriaBuilder.optTranslatorFactory(new HuggingfaceTextEmbeddingTranslatorFactory());
+                                              boolean neuron = false;
+                                              if (transformersType.name().endsWith("_NEURON")) {
+                                                  neuron = true;
+                                              }
+                                              criteriaBuilder.optTranslatorFactory(new HuggingfaceTextEmbeddingTranslatorFactory(poolingMethod, normalizeResult, modelType, neuron));
                                           }
                                       }

Collaborator

jngz-es Dec 20, 2022

We could refactor this part to support more engines better in the future.

Collaborator Author

ylwu-amzn Dec 20, 2022

yes, we can refactor this when we support new engines, add some todo now

...hms/src/main/java/org/opensearch/ml/engine/algorithms/text_embedding/TextEmbeddingModel.java Outdated

Comment on lines 217 to 221

+                                          StringBuilder builder = new StringBuilder();
+                                          for (int j=0;j<modelMaxLength;j++) {
+                                              builder.append("sentence ");
+                                          }
+                                          input.add(builder.toString());

Collaborator

jngz-es Dec 20, 2022

How about just replacing this part with one line as below?
input.add("sentence ".repeat(modelMaxLength));

Collaborator Author

ylwu-amzn Dec 20, 2022

sure, good point


          address comments

cd8d30a

Signed-off-by: Yaliang Wu <[email protected]>

ylwu-amzn dismissed rbhavna’s stale review via

cd8d30a

December 20, 2022 02:31

codecov-commenter commented Dec 20, 2022 •

edited

Loading

Codecov Report

Merging #640 (cd8d30a) into 2.x (d92e229) will decrease coverage by 0.11%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##                2.x     #640      +/-   ##
============================================
- Coverage     84.68%   84.57%   -0.12%     
+ Complexity      984      982       -2     
============================================
  Files            92       92              
  Lines          3540     3540              
  Branches        326      326              
============================================
- Hits           2998     2994       -4     
- Misses          407      410       +3     
- Partials        135      136       +1

Flag	Coverage Δ
ml-commons	`84.57% <ø> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
.../cluster/MLCommonsClusterManagerEventListener.java	`65.62% <0.00%> (-12.50%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jngz-es approved these changes

View reviewed changes

rbhavna approved these changes

View reviewed changes

ylwu-amzn merged commit b3ae98d into opensearch-project:2.x

ylwu-amzn added a commit to ylwu-amzn/ml-commons that referenced this pull request


          add more parameters for text embedding model (opensearch-project#640)

6f20220

* add more parameters for text embedding model

Signed-off-by: Yaliang Wu <[email protected]>

* upgrade junit version to 4.13.2

Signed-off-by: Yaliang Wu <[email protected]>

* address comments

Signed-off-by: Yaliang Wu <[email protected]>

Signed-off-by: Yaliang Wu <[email protected]>

ylwu-amzn added a commit to ylwu-amzn/ml-commons that referenced this pull request


          add more parameters for text embedding model (opensearch-project#640)

bef97c8

* add more parameters for text embedding model

Signed-off-by: Yaliang Wu <[email protected]>

* upgrade junit version to 4.13.2

Signed-off-by: Yaliang Wu <[email protected]>

* address comments

Signed-off-by: Yaliang Wu <[email protected]>

Signed-off-by: Yaliang Wu <[email protected]>

ylwu-amzn mentioned this pull request

[Backport to main] add more parameters for text embedding model (#640) #756

Merged

5 tasks

ylwu-amzn added a commit that referenced this pull request


          add more parameters for text embedding model (#640) (#756)

393f12b

* add more parameters for text embedding model



* upgrade junit version to 4.13.2



* address comments

Signed-off-by: Yaliang Wu <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels