[Backport to main]Add tokenizer and sparse encoding (#1301) #1394

zane-neo · 2023-09-27T01:37:11Z

add tokenizer and sparse encoding
add tokenizer and sparse encoding
add tokenizer and sparse encoding
add tokenizer and sparse encoding
add tokenizer and sparse encoding
remove special token
add filter
try empty model
remove warm up
try empty model
add block
add log
add log
add log
remove log
remove pt file detect
add log
add functionName pipeline
remove verify log
skip special token in sparse encoding
skip omit tokenize config
skip omit tokenize config-change warm up logic
reArch
deduplicate
omit ml config in sparse encoding
add null config in warm up
fix original test
add tokenize ut half
fix sparse encoding bug
add UT for sparse encoding and tokenize
remove useless framwork type
common/src/test/java/org/opensearch/ml/common/input/MLInputTest.java
change key for tokenize
reArch DLModel
reArch DLModel again
response format
tokenize only one output
clean sparse output
clean sparse output
change UT number
remove useless predict code
remove useless part
change tokenize way
reArch add textEmbedding model
add tokenize logic
add abstract
clear code
fix it class
fix it class
add IT file
reformulate
reformulate remote inference
reformulate remote inference
reformulate remote inference json and array
verify
undo string utils
skip dummy model
skip dummy model
skip dummy model
skip dummy model
skip dummy model
skip dummy model
add inner load Model
rename variable
add default for idf
add ut for sparse encoding and tokenizer
add close model
change mock class
remove buffer for sparse encoding output
change tokenize model ready logic
rewrite input functionName
deduplicate
change UT usage
fix downloadAndSplit test
fix Helper test
remove meaningless change
remove complie change
rename
fix typo error and simplify wrap code
add comment
using gson and remove useless close logic
update comment and import problem
add static idf name
fix format problem
extract an abstract model for sparse and dense sentence transformer translator
fix typo error
remove duplicate tokenizer file, fix import problem and add comment for tokenizer model

Description

Backport 31a4e25 from #1301

Issues Resolved

[List any issues this PR will resolve]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

* add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * add tokenizer and sparse encoding Signed-off-by: xinyual <[email protected]> * remove special token Signed-off-by: xinyual <[email protected]> * add filter Signed-off-by: xinyual <[email protected]> * try empty model Signed-off-by: xinyual <[email protected]> * remove warm up Signed-off-by: xinyual <[email protected]> * try empty model Signed-off-by: xinyual <[email protected]> * add block Signed-off-by: xinyual <[email protected]> * add log Signed-off-by: xinyual <[email protected]> * add log Signed-off-by: xinyual <[email protected]> * add log Signed-off-by: xinyual <[email protected]> * remove log Signed-off-by: xinyual <[email protected]> * remove pt file detect Signed-off-by: xinyual <[email protected]> * add log Signed-off-by: xinyual <[email protected]> * add functionName pipeline Signed-off-by: xinyual <[email protected]> * remove verify log Signed-off-by: xinyual <[email protected]> * skip special token in sparse encoding Signed-off-by: xinyual <[email protected]> * skip omit tokenize config Signed-off-by: xinyual <[email protected]> * skip omit tokenize config-change warm up logic Signed-off-by: xinyual <[email protected]> * reArch Signed-off-by: xinyual <[email protected]> * deduplicate Signed-off-by: xinyual <[email protected]> * omit ml config in sparse encoding Signed-off-by: xinyual <[email protected]> * add null config in warm up Signed-off-by: xinyual <[email protected]> * fix original test Signed-off-by: xinyual <[email protected]> * add tokenize ut half Signed-off-by: xinyual <[email protected]> * fix sparse encoding bug Signed-off-by: xinyual <[email protected]> * add UT for sparse encoding and tokenize Signed-off-by: xinyual <[email protected]> * remove useless framwork type Signed-off-by: xinyual <[email protected]> * common/src/test/java/org/opensearch/ml/common/input/MLInputTest.java Signed-off-by: xinyual <[email protected]> * change key for tokenize Signed-off-by: xinyual <[email protected]> * reArch DLModel Signed-off-by: xinyual <[email protected]> * reArch DLModel again Signed-off-by: xinyual <[email protected]> * response format Signed-off-by: xinyual <[email protected]> * tokenize only one output Signed-off-by: xinyual <[email protected]> * clean sparse output Signed-off-by: xinyual <[email protected]> * clean sparse output Signed-off-by: xinyual <[email protected]> * change UT number Signed-off-by: xinyual <[email protected]> * remove useless predict code Signed-off-by: xinyual <[email protected]> * remove useless part Signed-off-by: xinyual <[email protected]> * change tokenize way Signed-off-by: xinyual <[email protected]> * reArch add textEmbedding model Signed-off-by: xinyual <[email protected]> * add tokenize logic Signed-off-by: xinyual <[email protected]> * add abstract Signed-off-by: xinyual <[email protected]> * clear code Signed-off-by: xinyual <[email protected]> * fix it class Signed-off-by: xinyual <[email protected]> * fix it class Signed-off-by: xinyual <[email protected]> * add IT file Signed-off-by: xinyual <[email protected]> * reformulate Signed-off-by: xinyual <[email protected]> * reformulate remote inference Signed-off-by: xinyual <[email protected]> * reformulate remote inference Signed-off-by: xinyual <[email protected]> * reformulate remote inference json and array Signed-off-by: xinyual <[email protected]> * verify Signed-off-by: xinyual <[email protected]> * undo string utils Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * skip dummy model Signed-off-by: xinyual <[email protected]> * add inner load Model Signed-off-by: xinyual <[email protected]> * rename variable Signed-off-by: xinyual <[email protected]> * add default for idf Signed-off-by: xinyual <[email protected]> * add ut for sparse encoding and tokenizer Signed-off-by: xinyual <[email protected]> * add close model Signed-off-by: xinyual <[email protected]> * change mock class Signed-off-by: xinyual <[email protected]> * remove buffer for sparse encoding output Signed-off-by: xinyual <[email protected]> * change tokenize model ready logic Signed-off-by: xinyual <[email protected]> * rewrite input functionName Signed-off-by: xinyual <[email protected]> * deduplicate Signed-off-by: xinyual <[email protected]> * change UT usage Signed-off-by: xinyual <[email protected]> * fix downloadAndSplit test Signed-off-by: xinyual <[email protected]> * fix Helper test Signed-off-by: xinyual <[email protected]> * remove meaningless change Signed-off-by: xinyual <[email protected]> * remove complie change Signed-off-by: xinyual <[email protected]> * rename Signed-off-by: xinyual <[email protected]> * fix typo error and simplify wrap code Signed-off-by: xinyual <[email protected]> * add comment Signed-off-by: xinyual <[email protected]> * using gson and remove useless close logic Signed-off-by: xinyual <[email protected]> * update comment and import problem Signed-off-by: xinyual <[email protected]> * add static idf name Signed-off-by: xinyual <[email protected]> * fix format problem Signed-off-by: xinyual <[email protected]> * extract an abstract model for sparse and dense sentence transformer translator Signed-off-by: xinyual <[email protected]> * fix typo error Signed-off-by: xinyual <[email protected]> * remove duplicate tokenizer file, fix import problem and add comment for tokenizer model Signed-off-by: xinyual <[email protected]> --------- Signed-off-by: xinyual <[email protected]>

zane-neo · 2023-09-27T02:36:25Z

Closing this since it's already backported

zane-neo requested review from b4sjoo, dhrubo-os, jngz-es, model-collapse, rbhavna, wujunshen, ylwu-amzn and Zhangxunmt as code owners September 27, 2023 01:37

zane-neo had a problem deploying to ml-commons-cicd-env September 27, 2023 01:37 — with GitHub Actions Failure

zane-neo had a problem deploying to ml-commons-cicd-env September 27, 2023 01:37 — with GitHub Actions Error

zane-neo changed the title ~~Add tokenizer and sparse encoding (#1301)~~ [Backport to main]Add tokenizer and sparse encoding (#1301) Sep 27, 2023

dhrubo-os approved these changes Sep 27, 2023

View reviewed changes

zane-neo closed this Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport to main]Add tokenizer and sparse encoding (#1301) #1394

[Backport to main]Add tokenizer and sparse encoding (#1301) #1394

zane-neo commented Sep 27, 2023

zane-neo commented Sep 27, 2023

[Backport to main]Add tokenizer and sparse encoding (#1301) #1394

[Backport to main]Add tokenizer and sparse encoding (#1301) #1394

Conversation

zane-neo commented Sep 27, 2023

Description

Issues Resolved

Check List

zane-neo commented Sep 27, 2023