Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add New Question Answering Model #349

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

faradawn
Copy link

@faradawn faradawn commented Nov 30, 2023

Description

  • Added a question_answering_model.py to trace the model into TorchScript or Onnx format.
  • Added a test file to compare the traced model with the original model's output.

Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

Test cases

All tests passed.

test_cases = [
    {
        "question": "Who was Jim Henson?",
        "context": "Jim Henson was a nice puppet"
    },
    {
        "question": "Where do I live?",
        "context": "My name is Sarah and I live in London"
    },
    {
        "question": "What's my name?",
        "context": "My name is Clara and I live in Berkeley."
    },
    {
        "question": "Which name is also used to describe the Amazon rainforest in English?",
        "context": "The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain 'Amazonas' in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."
    }
]

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.


__all__ = ["SentenceTransformerModel", "MCorr"]
__all__ = ["SentenceTransformerModel", "SentenceTransformerModel", "MCorr"]
Copy link
Contributor

@rawwar rawwar Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be adding QuestionAnsweringModel?

import yaml
from accelerate import Accelerator, notebook_launcher
from mdutils.fileutils import MarkDownFile
# from sentence_transformers import SentenceTransformer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove commented imports?


default_model_id = "distilbert-base-cased-distilled-squad"

def clean_test_folder(TEST_FOLDER):
Copy link
Contributor

@rawwar rawwar Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please take a look at this link - It talks about how to use temporary files, directories with pytest. Do you think, using these fixtures help us?

Also, a helpful stackoverflow post - https://stackoverflow.com/questions/51593595/pytest-auto-delete-temporary-directory-created-with-tmpdir-factory

# max_position_embeddings

# AutoTokenizer will save tokenizer.json in save_json_folder_name
# DistilBertTokenizer will save it in cache: /Users/faradawn/.cache/huggingface/hub/models/...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to be your notes?

else self.onnx_zip_file_path
)

# model_zip_file_path = '/Users/faradawn/CS/opensearch-py-ml/opensearch_py_ml/ml_models/question-model-folder/distilbert-base-cased-distilled-squad.zip'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removable comment?

Copy link
Contributor

@rawwar rawwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reviewed code and hasn't actually tested its workings. Will provide another detailed feedback once i run these locally. Thanks!

@faradawn
Copy link
Author

Hi Kalyan,

Thank you for the detailed feedback! I have removed the uncessary comments and fixed init.py.

Regarding the use of fixture in pytest, I was following sentence_transformer's pytest structure, which used the raw "clean folder" function. I hoped to keep the test files similiar. But would love to learn about using fixture, if it can make a bigger improvement.

Thanks,
Faradawn

@rawwar
Copy link
Contributor

rawwar commented Nov 30, 2023

@dhrubo-os , can you please approve tests to run

Copy link

codecov bot commented Dec 6, 2023

Codecov Report

Attention: Patch coverage is 94.77124% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 91.64%. Comparing base (529ee34) to head (2e0b7c5).

Files Patch % Lines
...search_py_ml/ml_models/question_answering_model.py 94.70% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #349      +/-   ##
==========================================
+ Coverage   91.53%   91.64%   +0.10%     
==========================================
  Files          42       43       +1     
  Lines        4395     4547     +152     
==========================================
+ Hits         4023     4167     +144     
- Misses        372      380       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhrubo-os
Copy link
Collaborator

@faradawn I think overall this is a great start. Thanks for raising this PR

  1. lint is failing.
  2. codecode is already showing some missing coverage tests, please add more unit tests
  3. Address PR comments.

We are extending the program few more weeks. So please continue on this PR. Thanks, happy coding.

@faradawn
Copy link
Author

faradawn commented Dec 6, 2023 via email

opensearch_py_ml/ml_models/question_answering_model.py Outdated Show resolved Hide resolved
Download the model directly from huggingface, convert model to torch script format,
zip the model file and its tokenizer.json file to prepare to upload to the Open Search cluster

:param sentences:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's assigned default value ["today is sunny"] to sentences, so is it still required?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to optional.

Required, for example sentences = ['today is sunny']
:type sentences: List of string [str]
:param model_id:
question answering model id to download model from question answerings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the model_id also optional? it seems to have a default model id assigned

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to optional.

Required, for example sentences = ['today is sunny']
:type sentences: List of string [str]
:param model_id:
question answering model id to download model from question answerings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to download from question answerings or download from huggingface?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will download the model with 'model_id' from huggingface.

zip the model file and its tokenizer.json file to prepare to upload to the Open Search cluster

:param model_id:
question answering model id to download model from question answerings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to download from question answerings or download from huggingface?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will download from huggingface. Thanks!

Download question answering model directly from huggingface, convert model to onnx format,
zip the model file and its tokenizer.json file to prepare to upload to the Open Search cluster

:param model_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the model_id also optional? it seems to have a default model id assigned

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to optional.

@faradawn
Copy link
Author

Hi @mingshl,

Thank you for the careful review. I have made changes to the function descriptions, e.g. optional parameters, accordingly.

Hi @dhrubo-os,

I have checked CodeCov's result and added unit tests accordingly. The code is ready for a CodeCov test again.

Thanks,
Faradawn

@dhrubo-os
Copy link
Collaborator

@faradawn let's wrap up the PR? Can you please fix the conflicts?

@faradawn
Copy link
Author

Got it, Dhrubo. I will fix the following lint issue.

nox > black --check --target-version=py38 setup.py noxfile.py opensearch_py_ml/ utils/ tests/
would reformat /home/runner/work/opensearch-py-ml/opensearch-py-ml/opensearch_py_ml/ml_models/question_answering_model.py
would reformat /home/runner/work/opensearch-py-ml/opensearch-py-ml/tests/ml_models/test_question_answering_pytest.py

@dhrubo-os
Copy link
Collaborator

let's add the changelog file.

@faradawn
Copy link
Author

Hi @dhrubo-os,

Thanks for checking! I have added a missing package in requirement-dev.txt, added to CHANGELOG file, and fixed formating issues.

On my Mac, I only know pytest and nox -rs test. If there is a more comprehensive test I can run, please let me know!

Thanks.

@faradawn
Copy link
Author

Hi @dhrubo-os, the failing integration test is fixed. There are 4 worklows awaiting approval. If there is anything I can do, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants