Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: improve support for dataclasses #3142

Merged
merged 31 commits into from
Sep 9, 2022

Conversation

danielbichuetti
Copy link
Contributor

@danielbichuetti danielbichuetti commented Sep 2, 2022

Related Issues

Proposed Changes:

The latest update on pydantic caused some tests to fail. The main reason is related to some initialization pydantic do into dataclasses for his own internal logic. These extra fields were being mutated to Document.meta in reason of its logic. To avoid this error, I have filtered the new pydantic current initialization variable and also the old one (<1.10). This will make it more consistent.

Furthermore, I've refactored class initialization to start the usage of the dataclasses features. The custom logic has been moved to private post_init method.

UPDATE: because of some differences at dataclasses initialization in Python 3.7 and 3.9, code had to be refatored and the exclusive usage of post_init has been dropped. Using custom init and InitVar worked on supported environments.

How did you test it?

Currently making complete tests

Notes for the reviewer

Checklist

@danielbichuetti danielbichuetti requested review from a team as code owners September 2, 2022 14:02
@danielbichuetti danielbichuetti requested review from ZanSara and removed request for a team September 2, 2022 14:02
@danielbichuetti
Copy link
Contributor Author

Current code need refactoring due to Optional arg and mypy. I'll push new commit.

@ZanSara
Copy link
Contributor

ZanSara commented Sep 2, 2022

Hey @danielbichuetti, wow thank you! 🤩 Can I get started with the review already or you prefer me to wait for your fixes for mypy and typing?

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Sep 2, 2022

Hi @ZanSara. I will revert the initialization changes, so mypy results are ok with pydantic, and code can get checked by other tests.

I've started to work on some refactoring, looking for a better mypy integration.

As a side note, am I the only one getting issues with VS Code and pre-commits?

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Sep 2, 2022

Ok, beyond the mypy checking, there are differences when running tests on Python 3.9 and the CI which uses 3.7. I'm looking at what causes the errors in the different environments.

As a complement for further testing:

============================= test session starts ==============================
platform linux -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: /home/danielbichuetti/Dev/haystack, configfile: pyproject.toml
plugins: anyio-3.6.1, custom-exit-code-0.3.0, typeguard-2.13.3
collected 18 items

test/others/test_schema.py::test_no_answer_label PASSED                  [  5%]
test/others/test_schema.py::test_equal_label PASSED                      [ 11%]
test/others/test_schema.py::test_answer_to_json PASSED                   [ 16%]
test/others/test_schema.py::test_answer_to_dict PASSED                   [ 22%]
test/others/test_schema.py::test_label_to_json PASSED                    [ 27%]
test/others/test_schema.py::test_label_to_dict PASSED                    [ 33%]
test/others/test_schema.py::test_doc_to_json PASSED                      [ 38%]
test/others/test_schema.py::test_answer_postinit PASSED                  [ 44%]
test/others/test_schema.py::test_generate_doc_id_using_text PASSED       [ 50%]
test/others/test_schema.py::test_generate_doc_id_using_custom_list PASSED [ 55%]
test/others/test_schema.py::test_aggregate_labels_with_labels PASSED     [ 61%]
test/others/test_schema.py::test_multilabel_preserve_order PASSED        [ 66%]
test/others/test_schema.py::test_multilabel_preserve_order_w_duplicates PASSED [ 72%]
test/others/test_schema.py::test_multilabel_id PASSED                    [ 77%]
test/others/test_schema.py::test_serialize_speech_document PASSED        [ 83%]
test/others/test_schema.py::test_deserialize_speech_document PASSED      [ 88%]
test/others/test_schema.py::test_serialize_speech_answer PASSED          [ 94%]
test/others/test_schema.py::test_deserialize_speech_answer PASSED        [100%]

=============================== warnings summary ===============================
venv/lib/python3.9/site-packages/transformers/image_utils.py:222
  /home/danielbichuetti/Dev/haystack/venv/lib/python3.9/site-packages/transformers/image_utils.py:222: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
    def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
-------------- generated xml file: /tmp/tmp-17508cX6nXSA55Gqu.xml --------------
======================== 18 passed, 1 warning in 2.37s =========================

And on CI, Python 3.7:

=================================== FAILURES ===================================
____________________ test_generate_doc_id_using_custom_list ____________________

    def test_generate_doc_id_using_custom_list():
        text1 = "text1"
        text2 = "text2"
    
        doc1_meta1_id_by_content = Document(content=text1, meta={"name": "doc1"}, id_hash_keys=["content"])
        doc1_meta2_id_by_content = Document(content=text1, meta={"name": "doc2"}, id_hash_keys=["content"])
        assert doc1_meta1_id_by_content.id == doc1_meta2_id_by_content.id
    
        doc1_meta1_id_by_content_and_meta = Document(content=text1, meta={"name": "doc1"}, id_hash_keys=["content", "meta"])
        doc1_meta2_id_by_content_and_meta = Document(content=text1, meta={"name": "doc2"}, id_hash_keys=["content", "meta"])
>       assert doc1_meta1_id_by_content_and_meta.id != doc1_meta2_id_by_content_and_meta.id
E       AssertionError: assert '4f7944b21e062f189b8c19e9293c7602' != '4f7944b21e062f189b8c19e9293c7602'
E        +  where '4f7944b21e062f189b8c19e9293c7602' = <Document: {'content': 'text1', 'content_type': 'text', 'score': None, 'meta': {'name': 'doc1'}, 'embedding': None, 'id': '4f7944b21e062f189b8c19e9293c7602'}>.id
E        +  and   '4f7944b21e062f189b8c19e9293c7602' = <Document: {'content': 'text1', 'content_type': 'text', 'score': None, 'meta': {'name': 'doc2'}, 'embedding': None, 'id': '4f7944b21e062f189b8c19e9293c7602'}>.id

test/others/test_schema.py:198: AssertionError

@ZanSara
Copy link
Contributor

ZanSara commented Sep 2, 2022

As a side note, am I the only one getting issues with VS Code and pre-commits?

Which kind of issues? We've had in the past but I hoped they were mostly solved by now. Let's open an issue for this topic: even though it might be a simple misconfiguration, it's always helpful to leave an issue with the discussion for other users

@ZanSara
Copy link
Contributor

ZanSara commented Sep 2, 2022

Ok, beyond the mypy checking, there are differences when running tests on Python 3.9 and the CI which uses 3.7. I'm looking at what causes the errors in the different environments.

Alright! Given that there are still some fixes to be done, I'll set this one to draft until the CI looks good again

@ZanSara ZanSara marked this pull request as draft September 2, 2022 14:53
@ZanSara ZanSara requested review from a team and vblagoje and removed request for a team and vblagoje September 2, 2022 14:53
@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Sep 3, 2022

Tests are failing due to timeouts into haystack CI. At the same time, I have run them using my profile. All have run ok: https://github.com/danielbichuetti/haystack/actions/runs/2982709078

I'll force one extra commit tomorrow, maybe CI run correctly.

@danielbichuetti danielbichuetti marked this pull request as ready for review September 3, 2022 19:15
@danielbichuetti
Copy link
Contributor Author

May someone check CI please ? It has skipped some tests consistently.

@masci
Copy link
Contributor

masci commented Sep 4, 2022

The HF models caching fails quite often, besides the timeouts on the download process the problem is the cache shouldn't expire so often, I wonder if we're hitting the 10Gb size limit. Opened #3146 to track that, in the meantime re-running the job did the trick here ✅

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @danielbichuetti! Thanks a ton for this PR! Sorry if I've been a bit overzealous with the review, but this is a delicate change, so it's better to be extra-careful. Good news is, there are no big issues with the changes, it's just many small improvements.

Let me know if you have any question or if I got anything wrong 🙂

haystack/schema.py Show resolved Hide resolved
haystack/schema.py Outdated Show resolved Hide resolved
haystack/schema.py Outdated Show resolved Hide resolved
haystack/schema.py Show resolved Hide resolved
test/nodes/test_generator.py Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
haystack/schema.py Show resolved Hide resolved
haystack/schema.py Show resolved Hide resolved
haystack/schema.py Show resolved Hide resolved
@@ -50,7 +50,7 @@ dependencies = [
"importlib-metadata; python_version < '3.8'",
"torch>1.9,<1.13",
"requests",
"pydantic==1.9.2",
"pydantic",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth pinning pydantic<2 to avoid having the same problems all over again when Pydantic 2.0 comes out? @masci

Copy link
Contributor Author

@danielbichuetti danielbichuetti Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, I don't think it's necessary.

Here you can see what will be removed: https://pydantic-docs.helpmanual.io/blog/pydantic-v2/#removed-features-limitations.

The dataclasses will only have internal changes, not on the external interface. https://pydantic-docs.helpmanual.io/blog/pydantic-v2/#features-remaining

Furthermore, pydantic v2 will have a performance increase of at least 4x. So, its worth using it.

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Thanks again 😊

haystack/schema.py Show resolved Hide resolved
haystack/schema.py Show resolved Hide resolved
@danielbichuetti
Copy link
Contributor Author

@ZanSara I think you approved and forgot to merge 😆

@ZanSara
Copy link
Contributor

ZanSara commented Sep 9, 2022

Thank you for the ping!! 😄 I did forget to merge it 😄

@ZanSara ZanSara merged commit 621e1af into deepset-ai:main Sep 9, 2022
brandenchan pushed a commit that referenced this pull request Sep 21, 2022
* refactor: improve support for dataclasses

* refactor: refactor class init

* refactor: remove unused import

* refactor: testing 3.7 diffs

* refactor: checking meta where is Optional

* refactor: reverting some changes on 3.7

* refactor: remove unused imports

* build: manual pre-commit run

* doc: run doc pre-commit manually

* refactor: post initialization hack for 3.7-3.10 compat.

TODO: investigate another method to improve 3.7 compatibility.

* doc: force pre-commit

* refactor: refactored for both Python 3.7 and 3.9

* docs: manually run pre-commit hooks

* docs: run api docs manually

* docs: fix wrong comment

* refactor: change no type-checked test code

* docs: update primitives

* docs: api documentation

* docs: api documentation

* refactor: minor test refactoring

* refactor: remova unused enumeration on test

* refactor: remove unneeded dir in gitignore

* refactor: exclude all private fields and change meta def

* refactor: add pydantic comment

* refactor : fix for mypy on Python 3.7

* refactor: revert custom init

* docs: update docs to new pydoc-markdown style

* Update test/nodes/test_generator.py

Co-authored-by: Sara Zan <[email protected]>
@danielbichuetti danielbichuetti deleted the fix_pydantic_support branch September 26, 2022 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove pin from pydantic
4 participants