Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create a backend to parse USPTO patents into DoclingDocument #606

Merged
merged 6 commits into from
Dec 17, 2024

Conversation

ceberam
Copy link
Contributor

@ceberam ceberam commented Dec 16, 2024

Resolves #605

This PR implements the following changes:

  • Add a backend implementation to parse patent applications and grants from the United States Patent Office (USPTO).
  • Refactor the module docling/datamodel/document.py to address the scenario of multiple InputFormat instances with the same mime type. In particular, add a function that further examines part of an input document to guess the InputFormat instance to use for the conversion.

This PR is intended to be merged after #557

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Limitations:

The following points will need to be addressed in later PRs:

  • Slightly refactor guess_format function in docling/datamodel/document.py module, once we support another XML InputFormat, since application/xml mime type will already be ambiguous.
  • Add an abstract static method in abstract_bakend.py that examines a partial content of a document and determines if the backend implementation supports a document type with that content. This function could then be called in docling/datamodel/document.py module and avoid duplicated code when disambiguating mime types.
  • Add documentation and notebook examples.
  • Eventually create a default text/plain backend parser.

Copy link

mergify bot commented Dec 16, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

ceberam and others added 4 commits December 17, 2024 14:37
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
Add a backend implementation to parse patent applications and
grants from the United States Patent Office (USPTO).

Signed-off-by: Cesar Berrospi Ramis <[email protected]>
Change the name of the patent USPTO input format to show the typical format (XML).

Signed-off-by: Cesar Berrospi Ramis <[email protected]>
@ceberam ceberam merged commit 4e08750 into main Dec 17, 2024
9 checks passed
@ceberam ceberam deleted the dev/xml-uspto branch December 17, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a backend to transform USPTO patents (XML and TXT) to DoclingDocument
3 participants