-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: New layout processing with nested forms and key-value areas #616
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Panos Vagenas <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Typo faq.md Signed-off-by: Álvaro Huertas <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Gaspard Petit <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
* chore: reuse DocumentStream from docling-core Signed-off-by: Panos Vagenas <[email protected]> * update docling-core version Signed-off-by: Panos Vagenas <[email protected]> * [skip ci] document import line Signed-off-by: Panos Vagenas <[email protected]> * fix: use new resolve_source_to_x functions to avoid tempfile leftovers (#490) use new resolve_source_to_x functions Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Panos Vagenas <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
* docs: add styling to faq Signed-off-by: Michele Dolfi <[email protected]> * remove torchaudio Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: guglie <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
* fix: Fixes and tests for StopIteration on .convert() Signed-off-by: Christoph Auer <[email protected]> * fix: Remove unnecessary case handling Signed-off-by: Christoph Auer <[email protected]> * fix: Other test fixes Signed-off-by: Christoph Auer <[email protected]> * improve handling of unsupported types - Introduced new explicit exception types instead of `RuntimeError` - Introduced new `ConversionStatus` value for unsupported formats - Tidied up converter member typing & removed asserts Signed-off-by: Panos Vagenas <[email protected]> * robustify & simplify format option resolution Signed-off-by: Panos Vagenas <[email protected]> * rename new status, populate ConversionResult errors Signed-off-by: Panos Vagenas <[email protected]> --------- Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Panos Vagenas <[email protected]> Co-authored-by: Panos Vagenas <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
…-postprocessing Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
…evice from API, envvars, CLI. - Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run. - Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting. - Refactor the way how the docling-ibm-models are called to match the new init signature of models. - Translate the accelerator options to the specific inputs for third-party models. - Extend the docling CLI with parameters to set the num_threads and device. - Add new unit tests. - Write new example how to use the accelerator options. Signed-off-by: Christoph Auer <[email protected]>
* test: pin new docling-core changes and release pydantic pinning Signed-off-by: Michele Dolfi <[email protected]> * pin docling-core release Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
feat: Support hierarchical layout components, expose and group content in pictures, forms and key-value regions Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Fix for missing text in docx (t tag) when embedded in a table Signed-off-by: Maksym Lysak <[email protected]> Co-authored-by: Maksym Lysak <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
* updated README Signed-off-by: Peter Staar <[email protected]> * removed duck in title Signed-off-by: Peter Staar <[email protected]> * updated the index.md Signed-off-by: Peter Staar <[email protected]> * updated the cli to export html Signed-off-by: Peter Staar <[email protected]> * added html to cli Signed-off-by: Peter Staar <[email protected]> * reformatted the code Signed-off-by: Peter Staar <[email protected]> * removed the duck emoji, added the in the cli. Currently, the referenced seems broken Signed-off-by: Peter Staar <[email protected]> * cleaning up the comments Signed-off-by: Peter Staar <[email protected]> * reference is now working Signed-off-by: Peter Staar <[email protected]> * Clean up styling and docs Signed-off-by: Christoph Auer <[email protected]> * Pin docling-core>=2.7.1 Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: Peter Staar <[email protected]> Signed-off-by: Christoph Auer <[email protected]> Co-authored-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Nikos Livathinos <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
feat(Accelerator): Introduce AI runtime configuration scheme Signed-off-by: Christoph Auer <[email protected]>
feat: layout processing improvements and bugfixes Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]> Testing: (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose INFO:docling.document_converter:Going to convert document batch... INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec. WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert. INFO:docling.cli.main:Processed 1 docs, of which 1 failed INFO:docling.cli.main:All documents were converted in 36.29 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose INFO:docling.document_converter:Going to convert document batch... INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 58.56 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose INFO:docling.document_converter:Going to convert document batch... INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 59.88 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling Usage: docling [OPTIONS] source ╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --from [docx|pptx|html|image|pdf|asciido Specify input formats to convert │ │ c|md|xlsx] from. Defaults to all formats. │ │ [default: None] │ │ --to [md|json|html|text|doctags] Specify output formats. Defaults to │ │ Markdown. │ │ [default: None] │ │ --image-export-mode [placeholder|embedded|referenced] Image export mode for the document │ │ (only in case of JSON, Markdown or │ │ HTML). With `placeholder`, only the │ │ position of the image is marked in │ │ the output. In `embedded` mode, the │ │ image is embedded as base64 encoded │ │ string. In `referenced` mode, the │ │ image is exported in PNG format and │ │ referenced from the main exported │ │ document. │ │ [default: embedded] │ │ --ocr --no-ocr If enabled, the bitmap content will │ │ be processed using OCR. │ │ [default: ocr] │ │ --force-ocr --no-force-ocr Replace any existing text with OCR │ │ generated text over the full │ │ content. │ │ [default: no-force-ocr] │ │ --ocr-engine [easyocr|tesseract_cli|tesseract| The OCR engine to use. │ │ ocrmac|rapidocr] [default: easyocr] │ │ --ocr-lang TEXT Provide a comma-separated list of │ │ languages used by the OCR engine. │ │ Note that each OCR engine has │ │ different values for the language │ │ names. │ │ [default: None] │ │ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. │ │ [default: dlparse_v2] │ │ --table-mode [fast|accurate] The mode to use in the table │ │ structure model. │ │ [default: fast] │ │ --artifacts-path PATH If provided, the location of the │ │ model artifacts. │ │ [default: None] │ │ --abort-on-error --no-abort-on-error If enabled, the bitmap content will │ │ be processed using OCR. │ │ [default: no-abort-on-error] │ │ --output PATH Output directory where results are │ │ saved. │ │ [default: .] │ │ --verbose -v INTEGER Set the verbosity level. -v for │ │ info logging, -vv for debug │ │ logging. │ │ [default: 0] │ │ --debug-visualize-cells --no-debug-visualize-cells Enable debug output which │ │ visualizes the PDF cells │ │ [default: no-debug-visualize-cells] │ │ --debug-visualize-ocr --no-debug-visualize-ocr Enable debug output which │ │ visualizes the OCR cells │ │ [default: no-debug-visualize-ocr] │ │ --debug-visualize-layout --no-debug-visualize-layout Enable debug output which │ │ visualizes the layour clusters │ │ [default: │ │ no-debug-visualize-layout] │ │ --debug-visualize-tables --no-debug-visualize-tables Enable debug output which │ │ visualizes the table cells │ │ [default: │ │ no-debug-visualize-tables] │ │ --version Show version information. │ │ --document-timeout FLOAT The timeout for processing each │ │ document, in seconds. │ │ [default: None] │ │ --help Show this message and exit. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
* Upgraded Layout Postprocessing, sending old code back to ERZ Signed-off-by: Christoph Auer <[email protected]> * Implement hierachical cluster layout processing Signed-off-by: Christoph Auer <[email protected]> * Pass nested cluster processing through full pipeline Signed-off-by: Christoph Auer <[email protected]> * Pass nested clusters through GLM as payload Signed-off-by: Christoph Auer <[email protected]> * Move to_docling_document from ds-glm to this repo Signed-off-by: Christoph Auer <[email protected]> * Clean up imports again Signed-off-by: Christoph Auer <[email protected]> * feat(Accelerator): Introduce options to control the num_threads and device from API, envvars, CLI. - Introduce the AcceleratorOptions, AcceleratorDevice and use them to set the device where the models run. - Introduce the accelerator_utils with function to decide the device and resolve the AUTO setting. - Refactor the way how the docling-ibm-models are called to match the new init signature of models. - Translate the accelerator options to the specific inputs for third-party models. - Extend the docling CLI with parameters to set the num_threads and device. - Add new unit tests. - Write new example how to use the accelerator options. * fix: Improve the pydantic objects in the pipeline_options and imports. Signed-off-by: Nikos Livathinos <[email protected]> * fix: TableStructureModel: Refactor the artifacts path to use the new structure for fast/accurate model Signed-off-by: Nikos Livathinos <[email protected]> * Updated test ground-truth Signed-off-by: Christoph Auer <[email protected]> * Updated test ground-truth (again), bugfix for empty layout Signed-off-by: Christoph Auer <[email protected]> * fix: Do proper check to set the device in EasyOCR, RapidOCR. Signed-off-by: Nikos Livathinos <[email protected]> * Rollback changes from main Signed-off-by: Christoph Auer <[email protected]> * Update test gt Signed-off-by: Christoph Auer <[email protected]> * Remove unused debug settings Signed-off-by: Christoph Auer <[email protected]> * Review fixes Signed-off-by: Christoph Auer <[email protected]> * Nail the accelerator defaults for MPS Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Nikos Livathinos <[email protected]> Co-authored-by: Christoph Auer <[email protected]> Co-authored-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
docs: Fix the path to the run_with_accelerator.py example Signed-off-by: Nikos Livathinos <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
* Update easyocr_model.py Added this line of code to get recog_network of easyocr parameter recog_network = self.options.recog_network Signed-off-by: itsainii <[email protected]> * Update pipeline_options.py Added this line in EasyOcrOptions function recog_network: Optional[str] = 'standard' Signed-off-by: itsainii <[email protected]> * Add Easyocr recog_network parameter Signed-off-by: itsainii <[email protected]> --------- Signed-off-by: itsainii <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Checklist: