-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport to main]Add tokenizer and sparse encoding (#1301) #1394
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * remove special token Signed-off-by: xinyual <[email protected]> * add filter Signed-off-by: xinyual <[email protected]> * try empty model Signed-off-by: xinyual <[email protected]> * remove warm up Signed-off-by: xinyual <[email protected]> * try empty model Signed-off-by: xinyual <[email protected]> * add block Signed-off-by: xinyual <[email protected]> * add log Signed-off-by: xinyual <[email protected]> * add log Signed-off-by: xinyual <[email protected]> * add log Signed-off-by: xinyual <[email protected]> * remove log Signed-off-by: xinyual <[email protected]> * remove pt file detect Signed-off-by: xinyual <[email protected]> * add log Signed-off-by: xinyual <[email protected]> * add functionName pipeline Signed-off-by: xinyual <[email protected]> * remove verify log Signed-off-by: xinyual <[email protected]> * skip special token in sparse encoding Signed-off-by: xinyual <[email protected]> * skip omit tokenize config Signed-off-by: xinyual <[email protected]> * skip omit tokenize config-change warm up logic Signed-off-by: xinyual <[email protected]> * reArch Signed-off-by: xinyual <[email protected]> * deduplicate Signed-off-by: xinyual <[email protected]> * omit ml config in sparse encoding Signed-off-by: xinyual <[email protected]> * add null config in warm up Signed-off-by: xinyual <[email protected]> * fix original test Signed-off-by: xinyual <[email protected]> * add tokenize ut half Signed-off-by: xinyual <[email protected]> * fix sparse encoding bug Signed-off-by: xinyual <[email protected]> * add UT for sparse encoding and tokenize Signed-off-by: xinyual <[email protected]> * remove useless framwork type Signed-off-by: xinyual <[email protected]> * common/src/test/java/org/opensearch/ml/common/input/MLInputTest.java Signed-off-by: xinyual <[email protected]> * change key for tokenize Signed-off-by: xinyual <[email protected]> * reArch DLModel Signed-off-by: xinyual <[email protected]> * reArch DLModel again Signed-off-by: xinyual <[email protected]> * response format Signed-off-by: xinyual <[email protected]> * tokenize only one output Signed-off-by: xinyual <[email protected]> * clean sparse output Signed-off-by: xinyual <[email protected]> * clean sparse output Signed-off-by: xinyual <[email protected]> * change UT number Signed-off-by: xinyual <[email protected]> * remove useless predict code Signed-off-by: xinyual <[email protected]> * remove useless part Signed-off-by: xinyual <[email protected]> * change tokenize way Signed-off-by: xinyual <[email protected]> * reArch add textEmbedding model Signed-off-by: xinyual <[email protected]> * add tokenize logic Signed-off-by: xinyual <[email protected]> * add abstract Signed-off-by: xinyual <[email protected]> * clear code Signed-off-by: xinyual <[email protected]> * fix it class Signed-off-by: xinyual <[email protected]> * fix it class Signed-off-by: xinyual <[email protected]> * add IT file Signed-off-by: xinyual <[email protected]> * reformulate Signed-off-by: xinyual <[email protected]> * reformulate remote inference Signed-off-by: xinyual <[email protected]> * reformulate remote inference Signed-off-by: xinyual <[email protected]> * reformulate remote inference json and array Signed-off-by: xinyual <[email protected]> * verify Signed-off-by: xinyual <[email protected]> * undo string utils Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * add inner load Model Signed-off-by: xinyual <[email protected]> * rename variable Signed-off-by: xinyual <[email protected]> * add default for idf Signed-off-by: xinyual <[email protected]> * add ut for sparse encoding and tokenizer Signed-off-by: xinyual <[email protected]> * add close model Signed-off-by: xinyual <[email protected]> * change mock class Signed-off-by: xinyual <[email protected]> * remove buffer for sparse encoding output Signed-off-by: xinyual <[email protected]> * change tokenize model ready logic Signed-off-by: xinyual <[email protected]> * rewrite input functionName Signed-off-by: xinyual <[email protected]> * deduplicate Signed-off-by: xinyual <[email protected]> * change UT usage Signed-off-by: xinyual <[email protected]> * fix downloadAndSplit test Signed-off-by: xinyual <[email protected]> * fix Helper test Signed-off-by: xinyual <[email protected]> * remove meaningless change Signed-off-by: xinyual <[email protected]> * remove complie change Signed-off-by: xinyual <[email protected]> * rename Signed-off-by: xinyual <[email protected]> * fix typo error and simplify wrap code Signed-off-by: xinyual <[email protected]> * add comment Signed-off-by: xinyual <[email protected]> * using gson and remove useless close logic Signed-off-by: xinyual <[email protected]> * update comment and import problem Signed-off-by: xinyual <[email protected]> * add static idf name Signed-off-by: xinyual <[email protected]> * fix format problem Signed-off-by: xinyual <[email protected]> * extract an abstract model for sparse and dense sentence transformer translator Signed-off-by: xinyual <[email protected]> * fix typo error Signed-off-by: xinyual <[email protected]> * remove duplicate tokenizer file, fix import problem and add comment for tokenizer model Signed-off-by: xinyual <[email protected]> --------- Signed-off-by: xinyual <[email protected]>
zane-neo
requested review from
b4sjoo,
dhrubo-os,
jngz-es,
model-collapse,
rbhavna,
wujunshen,
ylwu-amzn and
Zhangxunmt
as code owners
September 27, 2023 01:37
zane-neo
had a problem deploying
to
ml-commons-cicd-env
September 27, 2023 01:37 — with
GitHub Actions
Failure
zane-neo
had a problem deploying
to
ml-commons-cicd-env
September 27, 2023 01:37 — with
GitHub Actions
Failure
zane-neo
had a problem deploying
to
ml-commons-cicd-env
September 27, 2023 01:37 — with
GitHub Actions
Error
zane-neo
had a problem deploying
to
ml-commons-cicd-env
September 27, 2023 01:37 — with
GitHub Actions
Error
zane-neo
had a problem deploying
to
ml-commons-cicd-env
September 27, 2023 01:37 — with
GitHub Actions
Error
zane-neo
had a problem deploying
to
ml-commons-cicd-env
September 27, 2023 01:37 — with
GitHub Actions
Error
zane-neo
changed the title
Add tokenizer and sparse encoding (#1301)
[Backport to main]Add tokenizer and sparse encoding (#1301)
Sep 27, 2023
dhrubo-os
approved these changes
Sep 27, 2023
Closing this since it's already backported |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
add tokenizer and sparse encoding
add tokenizer and sparse encoding
add tokenizer and sparse encoding
add tokenizer and sparse encoding
add tokenizer and sparse encoding
remove special token
add filter
try empty model
remove warm up
try empty model
add block
add log
add log
add log
remove log
remove pt file detect
add log
add functionName pipeline
remove verify log
skip special token in sparse encoding
skip omit tokenize config
skip omit tokenize config-change warm up logic
reArch
deduplicate
omit ml config in sparse encoding
add null config in warm up
fix original test
add tokenize ut half
fix sparse encoding bug
add UT for sparse encoding and tokenize
remove useless framwork type
common/src/test/java/org/opensearch/ml/common/input/MLInputTest.java
change key for tokenize
reArch DLModel
reArch DLModel again
response format
tokenize only one output
clean sparse output
clean sparse output
change UT number
remove useless predict code
remove useless part
change tokenize way
reArch add textEmbedding model
add tokenize logic
add abstract
clear code
fix it class
fix it class
add IT file
reformulate
reformulate remote inference
reformulate remote inference
reformulate remote inference json and array
verify
undo string utils
skip dummy model
skip dummy model
skip dummy model
skip dummy model
skip dummy model
skip dummy model
add inner load Model
rename variable
add default for idf
add ut for sparse encoding and tokenizer
add close model
change mock class
remove buffer for sparse encoding output
change tokenize model ready logic
rewrite input functionName
deduplicate
change UT usage
fix downloadAndSplit test
fix Helper test
remove meaningless change
remove complie change
rename
fix typo error and simplify wrap code
add comment
using gson and remove useless close logic
update comment and import problem
add static idf name
fix format problem
extract an abstract model for sparse and dense sentence transformer translator
fix typo error
remove duplicate tokenizer file, fix import problem and add comment for tokenizer model
Description
Backport 31a4e25 from #1301
Issues Resolved
[List any issues this PR will resolve]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.