fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. #544

nikos-livathinos · 2024-12-08T18:00:32Z

This is a fix to:

Introduce format options in the docling CLI for Image with the same pipeline_options as for PDF. This allows the CLI parameters to be applied in case of image inputs.
Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions.
Silence the tqdm messages during the downloading of model files.

Issues resolved by this Pull Request:
Resolves #505 #208

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

…ine_options. Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions Signed-off-by: Nikos Livathinos <[email protected]>

Signed-off-by: Nikos Livathinos <[email protected]>

mergify · 2024-12-08T18:01:04Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: Nikos Livathinos <[email protected]>

dolfim-ibm · 2024-12-09T07:26:01Z

docling/pipeline/standard_pdf_pipeline.py

+        # Disable tqdm prints used by HF
+        from tqdm import tqdm
+
+        tqdm.__init__ = partialmethod(tqdm.__init__, disable=True)


we shouldn't monkey patch, the HF library has specific options for it already
https://huggingface.co/docs/huggingface_hub/v0.26.5/en/package_reference/file_download#huggingface_hub.snapshot_download.tqdm_class

Signed-off-by: Nikos Livathinos <[email protected]>

cau-git · 2024-12-09T14:06:40Z

Maybe one more check to do is, if the input format is an image, OCR must be activated for its pipeline options (independent of the global OCR choice)

…ing messages. (DS4SD#544) * fix: main: Introduce format options for Image with the same pdf pipeline_options. Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions Signed-off-by: Nikos Livathinos <[email protected]> * fix: Silence the tqdm messages during the downloading of model files Signed-off-by: Nikos Livathinos <[email protected]> * fix: Code styling Signed-off-by: Nikos Livathinos <[email protected]> * fix: Use the HF API to disable the tqdm progress bars Signed-off-by: Nikos Livathinos <[email protected]> --------- Signed-off-by: Nikos Livathinos <[email protected]>

…ing messages. (#544) * fix: main: Introduce format options for Image with the same pdf pipeline_options. Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions Signed-off-by: Nikos Livathinos <[email protected]> * fix: Silence the tqdm messages during the downloading of model files Signed-off-by: Nikos Livathinos <[email protected]> * fix: Code styling Signed-off-by: Nikos Livathinos <[email protected]> * fix: Use the HF API to disable the tqdm progress bars Signed-off-by: Nikos Livathinos <[email protected]> --------- Signed-off-by: Nikos Livathinos <[email protected]> Signed-off-by: Christoph Auer <[email protected]>

nikos-livathinos added 2 commits December 8, 2024 18:32

fix: main: Introduce format options for Image with the same pdf pipel…

e125b9b

…ine_options. Add RapidOcrOptions to the Union of ocr_options for PdfPipelineOptions Signed-off-by: Nikos Livathinos <[email protected]>

fix: Silence the tqdm messages during the downloading of model files

64c7382

Signed-off-by: Nikos Livathinos <[email protected]>

nikos-livathinos requested review from cau-git, PeterStaar-IBM and dolfim-ibm December 8, 2024 18:00

nikos-livathinos marked this pull request as draft December 8, 2024 18:00

fix: Code styling

04977aa

Signed-off-by: Nikos Livathinos <[email protected]>

dolfim-ibm reviewed Dec 9, 2024

View reviewed changes

cau-git mentioned this pull request Dec 9, 2024

chore: change options pydantic schema to base options #497

Closed

3 tasks

nikos-livathinos added 2 commits December 9, 2024 14:25

fix: Use the HF API to disable the tqdm progress bars

bb83bc3

Signed-off-by: Nikos Livathinos <[email protected]>

Merge branch 'main' into nli/fix_ocr_options

cbf56ac

nikos-livathinos marked this pull request as ready for review December 9, 2024 13:48

cau-git approved these changes Dec 9, 2024

View reviewed changes

dolfim-ibm approved these changes Dec 9, 2024

View reviewed changes

cau-git merged commit 78f61a8 into main Dec 9, 2024
9 checks passed

cau-git deleted the nli/fix_ocr_options branch December 9, 2024 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. #544

fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. #544

nikos-livathinos commented Dec 8, 2024

mergify bot commented Dec 8, 2024

dolfim-ibm Dec 9, 2024

cau-git commented Dec 9, 2024

fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. #544

fix: Introduce Image format options in CLI. Silence the tqdm downloading messages. #544

Conversation

nikos-livathinos commented Dec 8, 2024

mergify bot commented Dec 8, 2024

Merge Protections

🟢 Enforce conventional commit

dolfim-ibm Dec 9, 2024

Choose a reason for hiding this comment

cau-git commented Dec 9, 2024