fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` #8430

shadeMe · 2024-10-01T11:01:32Z

Proposed Changes:

The PyPDFToDocument component was incorrectly serializing its default converter. This PR fixes it and deprecates the latter.

Utility methods were added to aid the serde of custom classes that implement from_dict and to_dict methods.

How did you test it?

Unit tests

Notes for the reviewer

This is the follow-up PR to this one.
If you have better names for the utility classes, I'm all ears.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2024-10-01T11:10:08Z

Pull Request Test Coverage Report for Build 11127151741

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
11 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.03%) to 90.248%

Files with Coverage Reduction	New Missed Lines	%
components/converters/pypdf.py	11	83.58%

Totals
Change from base Build 11122967174:	-0.03%
Covered Lines:	7413
Relevant Lines:	8214

💛 - Coveralls

wochinge · 2024-10-01T11:28:07Z

Thanks, @shadeMe The more detailed errors will also be a great help to the users!

…ate `DefaultConverter`

julian-risch

Thanks for working on this so quickly and thoroughly! The changes look very good to me. The naming of the util methods too. Only suggestion I have is to add tests for the new util methods auto_serialize_class_instance and auto_deserialize_class_instance too. What do you think @silvanocerza ?

I would suggest to also update the class docstring "If no converter is provided, uses a default text extraction ~~converter~~ implementation." or something like that here. Makes it more clear that there is no default converter anymore.

anakin87 · 2024-10-01T13:10:24Z

@shadeMe I'm probably missing some context.

I understand that there are some serialization issues.

But why did we decide to deprecate DefaultConverter?

shadeMe · 2024-10-01T13:16:58Z

@shadeMe I'm probably missing some context.

I understand that there are some serialization issues.

But why did we decide to deprecate DefaultConverter?

Because it honestly didn't have a good reason to exist outside the component, which was primarily the reason why the serialization bug crept in.

silvanocerza · 2024-10-01T13:19:34Z

Thanks for working on this so quickly and thoroughly! The changes look very good to me. The naming of the util methods too. Only suggestion I have is to add tests for the new util methods auto_serialize_class_instance and auto_deserialize_class_instance too. What do you think @silvanocerza ?

@julian-risch My only concern about the methods is the auto_ prefix. @shadeMe and I talked a bit about it and the main concern is that it would be too generic. Given that we have already other methods to handle serde it might get confusing. I would still argue to remove it though.

But why did we decide to deprecate DefaultConverter?

@anakin87 Mainly the assumption that converters don't need state most of the times so they can be simple functions. Though I'm unsure about that, I kinda remember that converters can be configured and if we treat them as callables we'd lose that possibility.

I remember that you briefly worked on this to change the converter backend so I thought you'd know more if that's the case or not.

anakin87 · 2024-10-01T13:41:47Z

Now I better understand the motivation and am OK with deprecating/removing DefaultConverter.
(The only reason it might be useful is as an example of implementing a PyPDFConverter, but this can also be inferred from the Protocol.)

In general, this component has always seemed a bit tricky to me from the UX point of view, and I would be happy if we improve it in the future.

julian-risch

LGTM! 👍

shadeMe · 2024-10-01T14:35:08Z

Test failure is unrelated; merging.

…ate `DefaultConverter` (#8430) * fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` * Remove `auto` prefix from serde util function names, add unit tests

shadeMe requested review from silvanocerza and julian-risch October 1, 2024 11:01

shadeMe requested review from a team as code owners October 1, 2024 11:01

shadeMe requested review from dfokina and Amnah199 and removed request for a team October 1, 2024 11:01

github-actions bot added type:documentation Improvements on the docs topic:tests and removed type:documentation Improvements on the docs labels Oct 1, 2024

shadeMe removed the request for review from Amnah199 October 1, 2024 11:02

fix: PyPDFToDocument correctly serializes custom converters, deprec…

0754486

…ate `DefaultConverter`

shadeMe force-pushed the fix/pypdf-converter-serde branch from e16058d to 0754486 Compare October 1, 2024 11:30

github-actions bot added the type:documentation Improvements on the docs label Oct 1, 2024

shadeMe requested a review from anakin87 October 1, 2024 12:43

julian-risch requested changes Oct 1, 2024

View reviewed changes

Remove auto prefix from serde util function names, add unit tests

6a0434e

shadeMe requested a review from julian-risch October 1, 2024 14:11

julian-risch approved these changes Oct 1, 2024

View reviewed changes

shadeMe merged commit ee89f6a into deepset-ai:main Oct 1, 2024
17 of 18 checks passed

shadeMe deleted the fix/pypdf-converter-serde branch October 1, 2024 14:35

shadeMe mentioned this pull request Oct 8, 2024

fix: (Temporarily) Re-add suport for pre-2.6.0 YAMLs with PyPDFConverter #8443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` #8430

fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` #8430

shadeMe commented Oct 1, 2024 •

edited

Loading

coveralls commented Oct 1, 2024 •

edited

Loading

wochinge commented Oct 1, 2024

julian-risch left a comment

anakin87 commented Oct 1, 2024

shadeMe commented Oct 1, 2024

silvanocerza commented Oct 1, 2024 •

edited

Loading

anakin87 commented Oct 1, 2024

julian-risch left a comment

shadeMe commented Oct 1, 2024

fix: PyPDFToDocument correctly serializes custom converters, deprecate DefaultConverter #8430

fix: PyPDFToDocument correctly serializes custom converters, deprecate DefaultConverter #8430

Conversation

shadeMe commented Oct 1, 2024 • edited Loading

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

coveralls commented Oct 1, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11127151741

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

wochinge commented Oct 1, 2024

julian-risch left a comment

Choose a reason for hiding this comment

anakin87 commented Oct 1, 2024

shadeMe commented Oct 1, 2024

silvanocerza commented Oct 1, 2024 • edited Loading

anakin87 commented Oct 1, 2024

julian-risch left a comment

Choose a reason for hiding this comment

shadeMe commented Oct 1, 2024

fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` #8430

fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` #8430

shadeMe commented Oct 1, 2024 •

edited

Loading

coveralls commented Oct 1, 2024 •

edited

Loading

silvanocerza commented Oct 1, 2024 •

edited

Loading