Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: DOCXToDocument converter - store DOCX metadata as a dict #8804

Merged
merged 2 commits into from
Feb 5, 2025

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Feb 4, 2025

Related Issues

In Documents produced by DOCXToDocument converter, DOCX metadata is stored in the docx key of meta as a DOCXMetadata dataclass.

While this approach is somewhat uncommon, it generally works because Documents are converted into dictionaries before being written to our Document Stores. The to_dict method handles this conversion, transforming also DOCXMetadata into a dictionary. When retrieving the Document from the Document Store, it won't have a DOCXMetadata dataclass but a dict, which is a bit inconsistent but not much problematic.

As you can see in the mentioned issue, Milvus Document Store (developed by the Milvus team) fails because it does not use to_dict and does not recognize the DOCXMetadata dataclass.

To improve transparency and avoid similar issues in the future (see also #8251), storing DOCX metadata directly as a dictionary makes more sense to me while keeping DOCXMetadata as an internal dataclass.

Proposed Changes:

  • store DOCX metadata as a dictionary
  • keep DOCXMetadata as a dataclass for internal use

How did you test it?

Notes for the reviewer

@sjrl I’d appreciate your review on this PR.

  • Does my approach make sense?
  • Do you see any potential negative impact on some of our users?
  • (Technically, this could be considered a breaking change, but it could also be seen as a bugfix or a consistency improvement. Given the small impact, I don’t think it is worth following the breaking change policy)

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@anakin87 anakin87 changed the title refactor!: DOCXToDocument - store DOCX metadata as a dict refactor!: DOCXToDocument converter - store DOCX metadata as a dict Feb 4, 2025
@coveralls
Copy link
Collaborator

coveralls commented Feb 4, 2025

Pull Request Test Coverage Report for Build 13134055176

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 91.291%

Totals Coverage Status
Change from base Build 13132521145: 0.0%
Covered Lines: 8931
Relevant Lines: 9783

💛 - Coveralls

@anakin87 anakin87 marked this pull request as ready for review February 4, 2025 11:05
@anakin87 anakin87 requested review from a team as code owners February 4, 2025 11:05
@anakin87 anakin87 requested review from dfokina, mpangrazzi and sjrl and removed request for a team February 4, 2025 11:05
Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! And yes I think this is small enough to not warrant following the breaking change policy. And as you say it feels more like a bug fix since it makes this component more consistent with our other converters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot Upload docx document to Milvus Database because of DOCXMetadata
3 participants