Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read function Name from pretrained model #1529

Merged
merged 11 commits into from
Nov 15, 2023

Conversation

xinyual
Copy link
Collaborator

@xinyual xinyual commented Oct 18, 2023

Description

Currently if we register pretrained model without url, it will set the default function name to text_embedding since we only have text embedding pretrained model. But now we have sparse encoding, so we need to read the function name from pretrained model config.

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@xinyual xinyual temporarily deployed to ml-commons-cicd-env October 18, 2023 04:53 — with GitHub Actions Inactive
@xinyual xinyual temporarily deployed to ml-commons-cicd-env October 18, 2023 04:53 — with GitHub Actions Inactive
@xinyual xinyual temporarily deployed to ml-commons-cicd-env October 18, 2023 04:53 — with GitHub Actions Inactive
@xinyual xinyual temporarily deployed to ml-commons-cicd-env October 18, 2023 04:53 — with GitHub Actions Inactive
@xinyual xinyual had a problem deploying to ml-commons-cicd-env October 18, 2023 04:53 — with GitHub Actions Failure
@xinyual xinyual temporarily deployed to ml-commons-cicd-env October 18, 2023 04:53 — with GitHub Actions Inactive
@codecov
Copy link

codecov bot commented Oct 18, 2023

Codecov Report

Attention: 98 lines in your changes are missing coverage. Please review.

Comparison is base (568bc7e) 79.42% compared to head (3c7216b) 79.54%.
Report is 2 commits behind head on main.

Files Patch % Lines
...rithms/metrics_correlation/MetricsCorrelation.java 0.00% 46 Missing ⚠️
...pensearch/ml/engine/algorithms/DLModelExecute.java 0.00% 10 Missing ⚠️
.../ml/engine/algorithms/clustering/RCFSummarize.java 78.94% 7 Missing and 1 partial ⚠️
...ain/java/org/opensearch/ml/engine/ModelHelper.java 78.26% 4 Missing and 1 partial ⚠️
..._embedding/HuggingfaceTextEmbeddingTranslator.java 33.33% 4 Missing ⚠️
...gine/algorithms/tokenize/SparseTokenizerModel.java 78.57% 2 Missing and 1 partial ⚠️
...l/engine/algorithms/ad/AnomalyDetectionLibSVM.java 87.50% 0 Missing and 2 partials ⚠️
...thms/anomalylocalization/AnomalyLocalizerImpl.java 96.66% 0 Missing and 2 partials ⚠️
...rics_correlation/MetricsCorrelationTranslator.java 0.00% 2 Missing ⚠️
...ine/algorithms/rcf/FixedInTimeRandomCutForest.java 93.33% 1 Missing and 1 partial ⚠️
... and 12 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1529      +/-   ##
============================================
+ Coverage     79.42%   79.54%   +0.11%     
- Complexity     3982     3987       +5     
============================================
  Files           390      390              
  Lines         16215    16277      +62     
  Branches       1751     1751              
============================================
+ Hits          12879    12947      +68     
+ Misses         2661     2655       -6     
  Partials        675      675              
Flag Coverage Δ
ml-commons 79.54% <80.51%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -87,7 +87,8 @@ public void downloadPrebuiltModelConfig(String taskId, MLRegisterModelInput regi
.url(modelZipFileUrl)
.deployModel(deployModel)
.modelNodeIds(modelNodeIds)
.modelGroupId(modelGroupId);
.modelGroupId(modelGroupId)
.functionName(FunctionName.from((String) config.get("model_task_type")));;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we get Function name from registerModelInput like the way we get other inputs from 57-62 lines?

Copy link
Collaborator Author

@xinyual xinyual Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registerModelInput is from the request body json. So we can request customer to provide it. But if we want to keep the request convention to only contain "name, version, model_format" for our pretrained model, we can only read it from pretrained config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if config.get("model_task_type") is null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for each pretrained model, we should have that field.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like may be we should address this section

currently we default to text embedding, which doesn't seem right. What happens if we start adding different pre-trained models like right now we added splade model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like may be we should address this section

currently we default to text embedding, which doesn't seem right. What happens if we start adding different pre-trained models like right now we added splade model.

Can we make the function name mandatory, when it's null, throw exception instead of setting to a default value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one solution could be: we can add function_name in our model listing and then get the function name from there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like may be we should address this section
currently we default to text embedding, which doesn't seem right. What happens if we start adding different pre-trained models like right now we added splade model.

Can we make the function name mandatory, when it's null, throw exception instead of setting to a default value?

No. When we register pretrained model, we don't provide function name so in some scenario it would be null. We need a default value.

Copy link
Collaborator Author

@xinyual xinyual Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one solution could be: we can add function_name in our model listing and then get the function name from there.

I think both work. Model listing and pretrained config are both files we maintain, so it's just an option to read from which file in our s3 bucket.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean now. We get this model_task_type from config.json like this. Yeah this should work.

@@ -609,7 +609,7 @@ private void uploadModel(MLRegisterModelInput registerModelInput, MLTask mlTask,

private void registerModelFromUrl(MLRegisterModelInput registerModelInput, MLTask mlTask, String modelVersion) {
String taskId = mlTask.getTaskId();
FunctionName functionName = mlTask.getFunctionName();
FunctionName functionName = registerModelInput.getFunctionName();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to change to registerModelInput ? Will it break BWC?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't provide url, the function name from ML task would be read from request body, while the modelInput would be generated from config. If we still use the function name from ml task, the name would still be null and the default would be "text_embedding". I have tried locally for body with url and without url, both worked for me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLTask must track the correct function name. Can you check if the function name in MLTask correct or not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I have tested it again. If the request body is like:
{
"name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-v1",
"version": "1.0.0",
"model_format": "TORCH_SCRIPT"
}

The function name inside the ml task would be text_embedding.

Copy link
Collaborator

@ylwu-amzn ylwu-amzn Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, after changing to registerModelInput.getFunctionName();, is the function name in MLTask correct now for both text embedding and sparse model?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already done by set function name inside mltask and rewrite to ml index.

@dhrubo-os
Copy link
Collaborator

There's a merge conflict.

Signed-off-by: xinyual <[email protected]>
@xinyual
Copy link
Collaborator Author

xinyual commented Nov 14, 2023

There's a merge conflict.

I guess we could do it now. I have merged from main branch.

Signed-off-by: xinyual <[email protected]>
@xinyual xinyual temporarily deployed to ml-commons-cicd-env November 14, 2023 23:49 — with GitHub Actions Inactive
@xinyual xinyual temporarily deployed to ml-commons-cicd-env November 14, 2023 23:49 — with GitHub Actions Inactive
@xinyual xinyual had a problem deploying to ml-commons-cicd-env November 14, 2023 23:49 — with GitHub Actions Failure
@xinyual xinyual temporarily deployed to ml-commons-cicd-env November 14, 2023 23:49 — with GitHub Actions Inactive
@xinyual xinyual temporarily deployed to ml-commons-cicd-env November 14, 2023 23:49 — with GitHub Actions Inactive
@xinyual xinyual temporarily deployed to ml-commons-cicd-env November 14, 2023 23:49 — with GitHub Actions Inactive
@xinyual xinyual had a problem deploying to ml-commons-cicd-env November 15, 2023 00:14 — with GitHub Actions Failure
@zane-neo zane-neo merged commit 4d53db5 into opensearch-project:main Nov 15, 2023
10 of 14 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-1529-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 4d53db5d987b1940102c0f1eba12295a2f1bd5ca
# Push it to GitHub
git push --set-upstream origin backport/backport-1529-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-1529-to-2.x.

xinyual added a commit to xinyual/ml-commons that referenced this pull request Nov 15, 2023
zane-neo pushed a commit that referenced this pull request Nov 15, 2023
austintlee pushed a commit to austintlee/ml-commons that referenced this pull request Feb 29, 2024
* read Function name from pretrained config

Signed-off-by: xinyual <[email protected]>

* rewrite mltask

Signed-off-by: xinyual <[email protected]>

* optimize import

Signed-off-by: xinyual <[email protected]>

* apply spotless

Signed-off-by: xinyual <[email protected]>

* add test for function name

Signed-off-by: xinyual <[email protected]>

* apply spotless

Signed-off-by: xinyual <[email protected]>

* maintain single import

Signed-off-by: xinyual <[email protected]>

* add more test

Signed-off-by: xinyual <[email protected]>

* apply spot less

Signed-off-by: xinyual <[email protected]>

* apply spot less

Signed-off-by: xinyual <[email protected]>

---------

Signed-off-by: xinyual <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants