Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport to main]Add tokenizer and sparse encoding (#1301) #1394

Closed

Conversation

zane-neo
Copy link
Collaborator

  • add tokenizer and sparse encoding

  • add tokenizer and sparse encoding

  • add tokenizer and sparse encoding

  • add tokenizer and sparse encoding

  • add tokenizer and sparse encoding

  • remove special token

  • add filter

  • try empty model

  • remove warm up

  • try empty model

  • add block

  • add log

  • add log

  • add log

  • remove log

  • remove pt file detect

  • add log

  • add functionName pipeline

  • remove verify log

  • skip special token in sparse encoding

  • skip omit tokenize config

  • skip omit tokenize config-change warm up logic

  • reArch

  • deduplicate

  • omit ml config in sparse encoding

  • add null config in warm up

  • fix original test

  • add tokenize ut half

  • fix sparse encoding bug

  • add UT for sparse encoding and tokenize

  • remove useless framwork type

  • common/src/test/java/org/opensearch/ml/common/input/MLInputTest.java

  • change key for tokenize

  • reArch DLModel

  • reArch DLModel again

  • response format

  • tokenize only one output

  • clean sparse output

  • clean sparse output

  • change UT number

  • remove useless predict code

  • remove useless part

  • change tokenize way

  • reArch add textEmbedding model

  • add tokenize logic

  • add abstract

  • clear code

  • fix it class

  • fix it class

  • add IT file

  • reformulate

  • reformulate remote inference

  • reformulate remote inference

  • reformulate remote inference json and array

  • verify

  • undo string utils

  • skip dummy model

  • skip dummy model

  • skip dummy model

  • skip dummy model

  • skip dummy model

  • skip dummy model

  • add inner load Model

  • rename variable

  • add default for idf

  • add ut for sparse encoding and tokenizer

  • add close model

  • change mock class

  • remove buffer for sparse encoding output

  • change tokenize model ready logic

  • rewrite input functionName

  • deduplicate

  • change UT usage

  • fix downloadAndSplit test

  • fix Helper test

  • remove meaningless change

  • remove complie change

  • rename

  • fix typo error and simplify wrap code

  • add comment

  • using gson and remove useless close logic

  • update comment and import problem

  • add static idf name

  • fix format problem

  • extract an abstract model for sparse and dense sentence transformer translator

  • fix typo error

  • remove duplicate tokenizer file, fix import problem and add comment for tokenizer model


Description

Backport 31a4e25 from #1301

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* add tokenizer and sparse encoding

Signed-off-by: xinyual <[email protected]>

* remove special token

Signed-off-by: xinyual <[email protected]>

* add filter

Signed-off-by: xinyual <[email protected]>

* try empty model

Signed-off-by: xinyual <[email protected]>

* remove warm up

Signed-off-by: xinyual <[email protected]>

* try empty model

Signed-off-by: xinyual <[email protected]>

* add block

Signed-off-by: xinyual <[email protected]>

* add log

Signed-off-by: xinyual <[email protected]>

* add log

Signed-off-by: xinyual <[email protected]>

* add log

Signed-off-by: xinyual <[email protected]>

* remove log

Signed-off-by: xinyual <[email protected]>

* remove pt file detect

Signed-off-by: xinyual <[email protected]>

* add log

Signed-off-by: xinyual <[email protected]>

* add functionName pipeline

Signed-off-by: xinyual <[email protected]>

* remove verify log

Signed-off-by: xinyual <[email protected]>

* skip special token in sparse encoding

Signed-off-by: xinyual <[email protected]>

* skip omit tokenize config

Signed-off-by: xinyual <[email protected]>

* skip omit tokenize config-change warm up logic

Signed-off-by: xinyual <[email protected]>

* reArch

Signed-off-by: xinyual <[email protected]>

* deduplicate

Signed-off-by: xinyual <[email protected]>

* omit ml config in sparse encoding

Signed-off-by: xinyual <[email protected]>

* add null config in warm up

Signed-off-by: xinyual <[email protected]>

* fix original test

Signed-off-by: xinyual <[email protected]>

* add tokenize ut half

Signed-off-by: xinyual <[email protected]>

* fix sparse encoding bug

Signed-off-by: xinyual <[email protected]>

* add UT for sparse encoding and tokenize

Signed-off-by: xinyual <[email protected]>

* remove useless framwork type

Signed-off-by: xinyual <[email protected]>

* common/src/test/java/org/opensearch/ml/common/input/MLInputTest.java

Signed-off-by: xinyual <[email protected]>

* change key for tokenize

Signed-off-by: xinyual <[email protected]>

* reArch DLModel

Signed-off-by: xinyual <[email protected]>

* reArch DLModel again

Signed-off-by: xinyual <[email protected]>

* response format

Signed-off-by: xinyual <[email protected]>

* tokenize only one output

Signed-off-by: xinyual <[email protected]>

* clean sparse output

Signed-off-by: xinyual <[email protected]>

* clean sparse output

Signed-off-by: xinyual <[email protected]>

* change UT number

Signed-off-by: xinyual <[email protected]>

* remove useless predict code

Signed-off-by: xinyual <[email protected]>

* remove useless part

Signed-off-by: xinyual <[email protected]>

* change tokenize way

Signed-off-by: xinyual <[email protected]>

* reArch add textEmbedding model

Signed-off-by: xinyual <[email protected]>

* add tokenize logic

Signed-off-by: xinyual <[email protected]>

* add abstract

Signed-off-by: xinyual <[email protected]>

* clear code

Signed-off-by: xinyual <[email protected]>

* fix it class

Signed-off-by: xinyual <[email protected]>

* fix it class

Signed-off-by: xinyual <[email protected]>

* add IT file

Signed-off-by: xinyual <[email protected]>

* reformulate

Signed-off-by: xinyual <[email protected]>

* reformulate remote inference

Signed-off-by: xinyual <[email protected]>

* reformulate remote inference

Signed-off-by: xinyual <[email protected]>

* reformulate remote inference json and array

Signed-off-by: xinyual <[email protected]>

* verify

Signed-off-by: xinyual <[email protected]>

* undo string utils

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* skip dummy model

Signed-off-by: xinyual <[email protected]>

* add inner load Model

Signed-off-by: xinyual <[email protected]>

* rename variable

Signed-off-by: xinyual <[email protected]>

* add default for idf

Signed-off-by: xinyual <[email protected]>

* add ut for sparse encoding and tokenizer

Signed-off-by: xinyual <[email protected]>

* add close model

Signed-off-by: xinyual <[email protected]>

* change mock class

Signed-off-by: xinyual <[email protected]>

* remove buffer for sparse encoding output

Signed-off-by: xinyual <[email protected]>

* change tokenize model ready logic

Signed-off-by: xinyual <[email protected]>

* rewrite input functionName

Signed-off-by: xinyual <[email protected]>

* deduplicate

Signed-off-by: xinyual <[email protected]>

* change UT usage

Signed-off-by: xinyual <[email protected]>

* fix downloadAndSplit test

Signed-off-by: xinyual <[email protected]>

* fix Helper  test

Signed-off-by: xinyual <[email protected]>

* remove meaningless change

Signed-off-by: xinyual <[email protected]>

* remove complie change

Signed-off-by: xinyual <[email protected]>

* rename

Signed-off-by: xinyual <[email protected]>

* fix typo error and simplify wrap code

Signed-off-by: xinyual <[email protected]>

* add comment

Signed-off-by: xinyual <[email protected]>

* using gson and remove useless close logic

Signed-off-by: xinyual <[email protected]>

* update comment and import problem

Signed-off-by: xinyual <[email protected]>

* add static idf name

Signed-off-by: xinyual <[email protected]>

* fix format problem

Signed-off-by: xinyual <[email protected]>

* extract an abstract model for sparse and dense sentence transformer translator

Signed-off-by: xinyual <[email protected]>

* fix typo error

Signed-off-by: xinyual <[email protected]>

* remove duplicate tokenizer file, fix import problem and add comment for tokenizer model

Signed-off-by: xinyual <[email protected]>

---------

Signed-off-by: xinyual <[email protected]>
@zane-neo zane-neo changed the title Add tokenizer and sparse encoding (#1301) [Backport to main]Add tokenizer and sparse encoding (#1301) Sep 27, 2023
@zane-neo
Copy link
Collaborator Author

Closing this since it's already backported

@zane-neo zane-neo closed this Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants