Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout limit to document parsing job. #270

Closed
PeterStaar-IBM opened this issue Nov 7, 2024 · 6 comments
Closed

Add timeout limit to document parsing job. #270

PeterStaar-IBM opened this issue Nov 7, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request priority:high

Comments

@PeterStaar-IBM
Copy link
Contributor

Requested feature

We need to have a way to add a timeout parameter when processing a document. Currently, it happens in very rare cases that certain documents will take very long to convert. In a batch processing job, this might become problematic.

example use case:

temp.pdf

@PeterStaar-IBM PeterStaar-IBM added enhancement New feature or request priority:high labels Nov 7, 2024
@cau-git
Copy link
Contributor

cau-git commented Nov 7, 2024

Checking the attached PDF, it is not a surprise we see very long conversion time. It is fully scanned and has a lot of pages, which is very slow on CPU at least.

Generally, there are multiple strategies to avoid such samples clogging a bulk conversion pipeline.

  1. One can run over all docs with OCR off, and later rerun only those docs where the conversion result is empty (i.e. it may need OCR). Already possible with current version.
  2. We can extend docling to optionally stop converting a doc when a timeout is reached. This timeout can only be checked once after every next page batch (i.e. after multiples of 4 pages with the defaults). This would reflect as a status PARTIAL_SUCCESS. User code could either export the partial result or drop the document.

@ab-shrek
Copy link
Contributor

I am interested in this issue. Can you please assign this to me? Thanks :)

@nikos-livathinos nikos-livathinos self-assigned this Nov 11, 2024
@ab-shrek
Copy link
Contributor

Are you working on this @nikos-livathinos ?

@nikos-livathinos
Copy link
Collaborator

@ab-shrek great to see you are interested in helping out on this issue. Please submit a PR for our review.
Here are some hints:

  1. Introduce a new parameter (e.g. pdf_document_timeout) in PdfPipelineOptions (
    class PdfPipelineOptions(PipelineOptions):
    )
  2. Implement the timeout logic in the PaginatedPipeline._build_document() (
    def _build_document(self, conv_res: ConversionResult) -> ConversionResult:
    )
    • The timeout should apply to the PDF pipeline for the time needed to convert the entire document.
    • We should check for a timeout after the conversion of each page chunk (but the check is for the document not only for the current page chunk).
    • When a timeout happens, the loop exits and the conv_res.status should set to ConversionStatus.PARTIAL_SUCCESS.
  3. Extend the docling CLI (https://github.com/DS4SD/docling/blob/main/docling/cli/main.py) to expose a cmd argument (e.g. --document-timeout ) that sets the pdf_document_timeout inside the PdfPipelineOptions.

@ab-shrek
Copy link
Contributor

Great; thanks @nikos-livathinos. Let me get on this asap :)

ab-shrek pushed a commit to ab-shrek/docling that referenced this issue Nov 12, 2024
Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 87584.07it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 24.12 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 24.13 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 29037.49it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (6 s) exceeded the specified timeout of 5 s
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.82 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpzedg349h/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 10.82 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 88197.98it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 22.59 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 22.60 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None] [required]                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                       [docx|pptx|html|image|pdf|asciidoc|md]  Specify input formats to convert from. Defaults to all formats. [default: None]                                     │
│ --to                                         [md|json|text|doctags]                  Specify output formats. Defaults to Markdown. [default: None]                                                       │
│ --ocr                 --no-ocr                                                       If enabled, the bitmap content will be processed using OCR. [default: ocr]                                          │
│ --force-ocr           --no-force-ocr                                                 Replace any existing text with OCR generated text over the full content. [default: no-force-ocr]                    │
│ --ocr-engine                                 [easyocr|tesseract_cli|tesseract]       The OCR engine to use. [default: easyocr]                                                                           │
│ --pdf-backend                                [pypdfium2|dlparse_v1|dlparse_v2]       The PDF backend to use. [default: dlparse_v1]                                                                       │
│ --table-mode                                 [fast|accurate]                         The mode to use in the table structure model. [default: fast]                                                       │
│ --artifacts-path                             PATH                                    If provided, the location of the model artifacts. [default: None]                                                   │
│ --abort-on-error      --no-abort-on-error                                            If enabled, the bitmap content will be processed using OCR. [default: no-abort-on-error]                            │
│ --output                                     PATH                                    Output directory where results are saved. [default: .]                                                              │
│ --version                                                                            Show version information.                                                                                           │
│ --document-timeout                           INTEGER                                 The timeout for processing each document, in seconds. [default: None]                                               │
│ --help                                                                               Show this message and exit.                                                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ab-shrek pushed a commit to ab-shrek/docling that referenced this issue Nov 22, 2024
Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100.123
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 27513.66it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 23.67 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 23.68 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5.4567
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 50805.84it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (6.477 seconds) exceeded the specified timeout of 5.457 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.65 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmp9v8ng4n3/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 10.65 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 85792.58it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 21.84 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 21.85 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. │
│                                 [default: None]                                                   │
│                                 [required]                                                        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                       [docx|pptx|html|image|pd  Specify input formats to   │
│                                              f|asciidoc|md]            convert from. Defaults to  │
│                                                                        all formats.               │
│                                                                        [default: None]            │
│ --to                                         [md|json|text|doctags]    Specify output formats.    │
│                                                                        Defaults to Markdown.      │
│                                                                        [default: None]            │
│ --ocr                 --no-ocr                                         If enabled, the bitmap     │
│                                                                        content will be processed  │
│                                                                        using OCR.                 │
│                                                                        [default: ocr]             │
│ --force-ocr           --no-force-ocr                                   Replace any existing text  │
│                                                                        with OCR generated text    │
│                                                                        over the full content.     │
│                                                                        [default: no-force-ocr]    │
│ --ocr-engine                                 [easyocr|tesseract_cli|t  The OCR engine to use.     │
│                                              esseract]                 [default: easyocr]         │
│ --pdf-backend                                [pypdfium2|dlparse_v1|dl  The PDF backend to use.    │
│                                              parse_v2]                 [default: dlparse_v1]      │
│ --table-mode                                 [fast|accurate]           The mode to use in the     │
│                                                                        table structure model.     │
│                                                                        [default: fast]            │
│ --artifacts-path                             PATH                      If provided, the location  │
│                                                                        of the model artifacts.    │
│                                                                        [default: None]            │
│ --abort-on-error      --no-abort-on-error                              If enabled, the bitmap     │
│                                                                        content will be processed  │
│                                                                        using OCR.                 │
│                                                                        [default:                  │
│                                                                        no-abort-on-error]         │
│ --output                                     PATH                      Output directory where     │
│                                                                        results are saved.         │
│                                                                        [default: .]               │
│ --version                                                              Show version information.  │
│ --document-timeout                           FLOAT                     The timeout for processing │
│                                                                        each document, in seconds. │
│                                                                        [default: None]            │
│ --help                                                                 Show this message and      │
│                                                                        exit.                      │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
ab-shrek pushed a commit to ab-shrek/docling that referenced this issue Nov 23, 2024
Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100.123
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 27513.66it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 23.67 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 23.68 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5.4567
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 50805.84it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (6.477 seconds) exceeded the specified timeout of 5.457 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.65 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmp9v8ng4n3/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 10.65 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 85792.58it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 21.84 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 21.85 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. │
│                                 [default: None]                                                   │
│                                 [required]                                                        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                       [docx|pptx|html|image|pd  Specify input formats to   │
│                                              f|asciidoc|md]            convert from. Defaults to  │
│                                                                        all formats.               │
│                                                                        [default: None]            │
│ --to                                         [md|json|text|doctags]    Specify output formats.    │
│                                                                        Defaults to Markdown.      │
│                                                                        [default: None]            │
│ --ocr                 --no-ocr                                         If enabled, the bitmap     │
│                                                                        content will be processed  │
│                                                                        using OCR.                 │
│                                                                        [default: ocr]             │
│ --force-ocr           --no-force-ocr                                   Replace any existing text  │
│                                                                        with OCR generated text    │
│                                                                        over the full content.     │
│                                                                        [default: no-force-ocr]    │
│ --ocr-engine                                 [easyocr|tesseract_cli|t  The OCR engine to use.     │
│                                              esseract]                 [default: easyocr]         │
│ --pdf-backend                                [pypdfium2|dlparse_v1|dl  The PDF backend to use.    │
│                                              parse_v2]                 [default: dlparse_v1]      │
│ --table-mode                                 [fast|accurate]           The mode to use in the     │
│                                                                        table structure model.     │
│                                                                        [default: fast]            │
│ --artifacts-path                             PATH                      If provided, the location  │
│                                                                        of the model artifacts.    │
│                                                                        [default: None]            │
│ --abort-on-error      --no-abort-on-error                              If enabled, the bitmap     │
│                                                                        content will be processed  │
│                                                                        using OCR.                 │
│                                                                        [default:                  │
│                                                                        no-abort-on-error]         │
│ --output                                     PATH                      Output directory where     │
│                                                                        results are saved.         │
│                                                                        [default: .]               │
│ --version                                                              Show version information.  │
│ --document-timeout                           FLOAT                     The timeout for processing │
│                                                                        each document, in seconds. │
│                                                                        [default: None]            │
│ --help                                                                 Show this message and      │
│                                                                        exit.                      │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
ab-shrek added a commit to ab-shrek/docling that referenced this issue Dec 2, 2024
Signed-off-by: Abhishek Kumar <[email protected]>

Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100.123
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 27513.66it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 23.67 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 23.68 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5.4567
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 50805.84it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (6.477 seconds) exceeded the specified timeout of 5.457 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.65 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmp9v8ng4n3/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 10.65 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 85792.58it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 21.84 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 21.85 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. │
│                                 [default: None]                                                   │
│                                 [required]                                                        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                       [docx|pptx|html|image|pd  Specify input formats to   │
│                                              f|asciidoc|md]            convert from. Defaults to  │
│                                                                        all formats.               │
│                                                                        [default: None]            │
│ --to                                         [md|json|text|doctags]    Specify output formats.    │
│                                                                        Defaults to Markdown.      │
│                                                                        [default: None]            │
│ --ocr                 --no-ocr                                         If enabled, the bitmap     │
│                                                                        content will be processed  │
│                                                                        using OCR.                 │
│                                                                        [default: ocr]             │
│ --force-ocr           --no-force-ocr                                   Replace any existing text  │
│                                                                        with OCR generated text    │
│                                                                        over the full content.     │
│                                                                        [default: no-force-ocr]    │
│ --ocr-engine                                 [easyocr|tesseract_cli|t  The OCR engine to use.     │
│                                              esseract]                 [default: easyocr]         │
│ --pdf-backend                                [pypdfium2|dlparse_v1|dl  The PDF backend to use.    │
│                                              parse_v2]                 [default: dlparse_v1]      │
│ --table-mode                                 [fast|accurate]           The mode to use in the     │
│                                                                        table structure model.     │
│                                                                        [default: fast]            │
│ --artifacts-path                             PATH                      If provided, the location  │
│                                                                        of the model artifacts.    │
│                                                                        [default: None]            │
│ --abort-on-error      --no-abort-on-error                              If enabled, the bitmap     │
│                                                                        content will be processed  │
│                                                                        using OCR.                 │
│                                                                        [default:                  │
│                                                                        no-abort-on-error]         │
│ --output                                     PATH                      Output directory where     │
│                                                                        results are saved.         │
│                                                                        [default: .]               │
│ --version                                                              Show version information.  │
│ --document-timeout                           FLOAT                     The timeout for processing │
│                                                                        each document, in seconds. │
│                                                                        [default: None]            │
│ --help                                                                 Show this message and      │
│                                                                        exit.                      │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
ab-shrek added a commit to ab-shrek/docling that referenced this issue Dec 6, 2024
Signed-off-by: Abhishek Kumar <[email protected]>

Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100.123
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 27513.66it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 23.67 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 23.68 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5.4567
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 50805.84it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (6.477 seconds) exceeded the specified timeout of 5.457 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.65 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmp9v8ng4n3/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 10.65 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|█████████████████████████████████████████████| 9/9 [00:00<00:00, 85792.58it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 21.84 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 21.85 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. │
│                                 [default: None]                                                   │
│                                 [required]                                                        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                       [docx|pptx|html|image|pd  Specify input formats to   │
│                                              f|asciidoc|md]            convert from. Defaults to  │
│                                                                        all formats.               │
│                                                                        [default: None]            │
│ --to                                         [md|json|text|doctags]    Specify output formats.    │
│                                                                        Defaults to Markdown.      │
│                                                                        [default: None]            │
│ --ocr                 --no-ocr                                         If enabled, the bitmap     │
│                                                                        content will be processed  │
│                                                                        using OCR.                 │
│                                                                        [default: ocr]             │
│ --force-ocr           --no-force-ocr                                   Replace any existing text  │
│                                                                        with OCR generated text    │
│                                                                        over the full content.     │
│                                                                        [default: no-force-ocr]    │
│ --ocr-engine                                 [easyocr|tesseract_cli|t  The OCR engine to use.     │
│                                              esseract]                 [default: easyocr]         │
│ --pdf-backend                                [pypdfium2|dlparse_v1|dl  The PDF backend to use.    │
│                                              parse_v2]                 [default: dlparse_v1]      │
│ --table-mode                                 [fast|accurate]           The mode to use in the     │
│                                                                        table structure model.     │
│                                                                        [default: fast]            │
│ --artifacts-path                             PATH                      If provided, the location  │
│                                                                        of the model artifacts.    │
│                                                                        [default: None]            │
│ --abort-on-error      --no-abort-on-error                              If enabled, the bitmap     │
│                                                                        content will be processed  │
│                                                                        using OCR.                 │
│                                                                        [default:                  │
│                                                                        no-abort-on-error]         │
│ --output                                     PATH                      Output directory where     │
│                                                                        results are saved.         │
│                                                                        [default: .]               │
│ --version                                                              Show version information.  │
│ --document-timeout                           FLOAT                     The timeout for processing │
│                                                                        each document, in seconds. │
│                                                                        [default: None]            │
│ --help                                                                 Show this message and      │
│                                                                        exit.                      │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
ab-shrek added a commit to ab-shrek/docling that referenced this issue Dec 9, 2024
Signed-off-by: Abhishek Kumar <[email protected]>

Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 36.29 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 58.56 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 59.88 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None] [required]        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                                       [docx|pptx|html|image|pdf|asciido  Specify input formats to convert    │
│                                                              c|md|xlsx]                         from. Defaults to all formats.      │
│                                                                                                 [default: None]                     │
│ --to                                                         [md|json|html|text|doctags]        Specify output formats. Defaults to │
│                                                                                                 Markdown.                           │
│                                                                                                 [default: None]                     │
│ --image-export-mode                                          [placeholder|embedded|referenced]  Image export mode for the document  │
│                                                                                                 (only in case of JSON, Markdown or  │
│                                                                                                 HTML). With `placeholder`, only the │
│                                                                                                 position of the image is marked in  │
│                                                                                                 the output. In `embedded` mode, the │
│                                                                                                 image is embedded as base64 encoded │
│                                                                                                 string. In `referenced` mode, the   │
│                                                                                                 image is exported in PNG format and │
│                                                                                                 referenced from the main exported   │
│                                                                                                 document.                           │
│                                                                                                 [default: embedded]                 │
│ --ocr                         --no-ocr                                                          If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: ocr]                      │
│ --force-ocr                   --no-force-ocr                                                    Replace any existing text with OCR  │
│                                                                                                 generated text over the full        │
│                                                                                                 content.                            │
│                                                                                                 [default: no-force-ocr]             │
│ --ocr-engine                                                 [easyocr|tesseract_cli|tesseract|  The OCR engine to use.              │
│                                                              ocrmac|rapidocr]                   [default: easyocr]                  │
│ --ocr-lang                                                   TEXT                               Provide a comma-separated list of   │
│                                                                                                 languages used by the OCR engine.   │
│                                                                                                 Note that each OCR engine has       │
│                                                                                                 different values for the language   │
│                                                                                                 names.                              │
│                                                                                                 [default: None]                     │
│ --pdf-backend                                                [pypdfium2|dlparse_v1|dlparse_v2]  The PDF backend to use.             │
│                                                                                                 [default: dlparse_v2]               │
│ --table-mode                                                 [fast|accurate]                    The mode to use in the table        │
│                                                                                                 structure model.                    │
│                                                                                                 [default: fast]                     │
│ --artifacts-path                                             PATH                               If provided, the location of the    │
│                                                                                                 model artifacts.                    │
│                                                                                                 [default: None]                     │
│ --abort-on-error              --no-abort-on-error                                               If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: no-abort-on-error]        │
│ --output                                                     PATH                               Output directory where results are  │
│                                                                                                 saved.                              │
│                                                                                                 [default: .]                        │
│ --verbose                 -v                                 INTEGER                            Set the verbosity level. -v for     │
│                                                                                                 info logging, -vv for debug         │
│                                                                                                 logging.                            │
│                                                                                                 [default: 0]                        │
│ --debug-visualize-cells       --no-debug-visualize-cells                                        Enable debug output which           │
│                                                                                                 visualizes the PDF cells            │
│                                                                                                 [default: no-debug-visualize-cells] │
│ --debug-visualize-ocr         --no-debug-visualize-ocr                                          Enable debug output which           │
│                                                                                                 visualizes the OCR cells            │
│                                                                                                 [default: no-debug-visualize-ocr]   │
│ --debug-visualize-layout      --no-debug-visualize-layout                                       Enable debug output which           │
│                                                                                                 visualizes the layour clusters      │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-layout]          │
│ --debug-visualize-tables      --no-debug-visualize-tables                                       Enable debug output which           │
│                                                                                                 visualizes the table cells          │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-tables]          │
│ --version                                                                                       Show version information.           │
│ --document-timeout                                           FLOAT                              The timeout for processing each     │
│                                                                                                 document, in seconds.               │
│                                                                                                 [default: None]                     │
│ --help                                                                                          Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
nikos-livathinos pushed a commit that referenced this issue Dec 11, 2024
Signed-off-by: Abhishek Kumar <[email protected]>

Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 36.29 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 58.56 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 59.88 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None] [required]        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                                       [docx|pptx|html|image|pdf|asciido  Specify input formats to convert    │
│                                                              c|md|xlsx]                         from. Defaults to all formats.      │
│                                                                                                 [default: None]                     │
│ --to                                                         [md|json|html|text|doctags]        Specify output formats. Defaults to │
│                                                                                                 Markdown.                           │
│                                                                                                 [default: None]                     │
│ --image-export-mode                                          [placeholder|embedded|referenced]  Image export mode for the document  │
│                                                                                                 (only in case of JSON, Markdown or  │
│                                                                                                 HTML). With `placeholder`, only the │
│                                                                                                 position of the image is marked in  │
│                                                                                                 the output. In `embedded` mode, the │
│                                                                                                 image is embedded as base64 encoded │
│                                                                                                 string. In `referenced` mode, the   │
│                                                                                                 image is exported in PNG format and │
│                                                                                                 referenced from the main exported   │
│                                                                                                 document.                           │
│                                                                                                 [default: embedded]                 │
│ --ocr                         --no-ocr                                                          If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: ocr]                      │
│ --force-ocr                   --no-force-ocr                                                    Replace any existing text with OCR  │
│                                                                                                 generated text over the full        │
│                                                                                                 content.                            │
│                                                                                                 [default: no-force-ocr]             │
│ --ocr-engine                                                 [easyocr|tesseract_cli|tesseract|  The OCR engine to use.              │
│                                                              ocrmac|rapidocr]                   [default: easyocr]                  │
│ --ocr-lang                                                   TEXT                               Provide a comma-separated list of   │
│                                                                                                 languages used by the OCR engine.   │
│                                                                                                 Note that each OCR engine has       │
│                                                                                                 different values for the language   │
│                                                                                                 names.                              │
│                                                                                                 [default: None]                     │
│ --pdf-backend                                                [pypdfium2|dlparse_v1|dlparse_v2]  The PDF backend to use.             │
│                                                                                                 [default: dlparse_v2]               │
│ --table-mode                                                 [fast|accurate]                    The mode to use in the table        │
│                                                                                                 structure model.                    │
│                                                                                                 [default: fast]                     │
│ --artifacts-path                                             PATH                               If provided, the location of the    │
│                                                                                                 model artifacts.                    │
│                                                                                                 [default: None]                     │
│ --abort-on-error              --no-abort-on-error                                               If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: no-abort-on-error]        │
│ --output                                                     PATH                               Output directory where results are  │
│                                                                                                 saved.                              │
│                                                                                                 [default: .]                        │
│ --verbose                 -v                                 INTEGER                            Set the verbosity level. -v for     │
│                                                                                                 info logging, -vv for debug         │
│                                                                                                 logging.                            │
│                                                                                                 [default: 0]                        │
│ --debug-visualize-cells       --no-debug-visualize-cells                                        Enable debug output which           │
│                                                                                                 visualizes the PDF cells            │
│                                                                                                 [default: no-debug-visualize-cells] │
│ --debug-visualize-ocr         --no-debug-visualize-ocr                                          Enable debug output which           │
│                                                                                                 visualizes the OCR cells            │
│                                                                                                 [default: no-debug-visualize-ocr]   │
│ --debug-visualize-layout      --no-debug-visualize-layout                                       Enable debug output which           │
│                                                                                                 visualizes the layour clusters      │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-layout]          │
│ --debug-visualize-tables      --no-debug-visualize-tables                                       Enable debug output which           │
│                                                                                                 visualizes the table cells          │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-tables]          │
│ --version                                                                                       Show version information.           │
│ --document-timeout                                           FLOAT                              The timeout for processing each     │
│                                                                                                 document, in seconds.               │
│                                                                                                 [default: None]                     │
│ --help                                                                                          Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
@nikos-livathinos
Copy link
Collaborator

This feature has been implemented in this PR #552

cau-git pushed a commit that referenced this issue Dec 17, 2024
Signed-off-by: Abhishek Kumar <[email protected]>

Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=10 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (24.555 seconds) exceeded the specified timeout of 10.000 seconds
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 36.29 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpl6p08u5i/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 36.29 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 58.36 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 58.56 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --verbose
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 59.82 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 59.88 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None] [required]        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                                       [docx|pptx|html|image|pdf|asciido  Specify input formats to convert    │
│                                                              c|md|xlsx]                         from. Defaults to all formats.      │
│                                                                                                 [default: None]                     │
│ --to                                                         [md|json|html|text|doctags]        Specify output formats. Defaults to │
│                                                                                                 Markdown.                           │
│                                                                                                 [default: None]                     │
│ --image-export-mode                                          [placeholder|embedded|referenced]  Image export mode for the document  │
│                                                                                                 (only in case of JSON, Markdown or  │
│                                                                                                 HTML). With `placeholder`, only the │
│                                                                                                 position of the image is marked in  │
│                                                                                                 the output. In `embedded` mode, the │
│                                                                                                 image is embedded as base64 encoded │
│                                                                                                 string. In `referenced` mode, the   │
│                                                                                                 image is exported in PNG format and │
│                                                                                                 referenced from the main exported   │
│                                                                                                 document.                           │
│                                                                                                 [default: embedded]                 │
│ --ocr                         --no-ocr                                                          If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: ocr]                      │
│ --force-ocr                   --no-force-ocr                                                    Replace any existing text with OCR  │
│                                                                                                 generated text over the full        │
│                                                                                                 content.                            │
│                                                                                                 [default: no-force-ocr]             │
│ --ocr-engine                                                 [easyocr|tesseract_cli|tesseract|  The OCR engine to use.              │
│                                                              ocrmac|rapidocr]                   [default: easyocr]                  │
│ --ocr-lang                                                   TEXT                               Provide a comma-separated list of   │
│                                                                                                 languages used by the OCR engine.   │
│                                                                                                 Note that each OCR engine has       │
│                                                                                                 different values for the language   │
│                                                                                                 names.                              │
│                                                                                                 [default: None]                     │
│ --pdf-backend                                                [pypdfium2|dlparse_v1|dlparse_v2]  The PDF backend to use.             │
│                                                                                                 [default: dlparse_v2]               │
│ --table-mode                                                 [fast|accurate]                    The mode to use in the table        │
│                                                                                                 structure model.                    │
│                                                                                                 [default: fast]                     │
│ --artifacts-path                                             PATH                               If provided, the location of the    │
│                                                                                                 model artifacts.                    │
│                                                                                                 [default: None]                     │
│ --abort-on-error              --no-abort-on-error                                               If enabled, the bitmap content will │
│                                                                                                 be processed using OCR.             │
│                                                                                                 [default: no-abort-on-error]        │
│ --output                                                     PATH                               Output directory where results are  │
│                                                                                                 saved.                              │
│                                                                                                 [default: .]                        │
│ --verbose                 -v                                 INTEGER                            Set the verbosity level. -v for     │
│                                                                                                 info logging, -vv for debug         │
│                                                                                                 logging.                            │
│                                                                                                 [default: 0]                        │
│ --debug-visualize-cells       --no-debug-visualize-cells                                        Enable debug output which           │
│                                                                                                 visualizes the PDF cells            │
│                                                                                                 [default: no-debug-visualize-cells] │
│ --debug-visualize-ocr         --no-debug-visualize-ocr                                          Enable debug output which           │
│                                                                                                 visualizes the OCR cells            │
│                                                                                                 [default: no-debug-visualize-ocr]   │
│ --debug-visualize-layout      --no-debug-visualize-layout                                       Enable debug output which           │
│                                                                                                 visualizes the layour clusters      │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-layout]          │
│ --debug-visualize-tables      --no-debug-visualize-tables                                       Enable debug output which           │
│                                                                                                 visualizes the table cells          │
│                                                                                                 [default:                           │
│                                                                                                 no-debug-visualize-tables]          │
│ --version                                                                                       Show version information.           │
│ --document-timeout                                           FLOAT                              The timeout for processing each     │
│                                                                                                 document, in seconds.               │
│                                                                                                 [default: None]                     │
│ --help                                                                                          Show this message and exit.         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Signed-off-by: Christoph Auer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority:high
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants