refactor!: DOCXToDocument
converter - store DOCX metadata as a dict
#8804
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related Issues
In Documents produced by
DOCXToDocument
converter, DOCX metadata is stored in thedocx
key ofmeta
as aDOCXMetadata
dataclass.While this approach is somewhat uncommon, it generally works because Documents are converted into dictionaries before being written to our Document Stores. The
to_dict
method handles this conversion, transforming alsoDOCXMetadata
into a dictionary. When retrieving the Document from the Document Store, it won't have aDOCXMetadata
dataclass but a dict, which is a bit inconsistent but not much problematic.As you can see in the mentioned issue, Milvus Document Store (developed by the Milvus team) fails because it does not use
to_dict
and does not recognize theDOCXMetadata
dataclass.To improve transparency and avoid similar issues in the future (see also #8251), storing DOCX metadata directly as a dictionary makes more sense to me while keeping
DOCXMetadata
as an internal dataclass.Proposed Changes:
DOCXMetadata
as a dataclass for internal useHow did you test it?
DOCXToDocument
.Notes for the reviewer
@sjrl I’d appreciate your review on this PR.
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
and added!
in case the PR includes breaking changes.