Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656 #470

Closed
guglie opened this issue Nov 29, 2024 · 2 comments · Fixed by #472
Labels
bug Something isn't working

Comments

@guglie
Copy link
Contributor

guglie commented Nov 29, 2024

Bug

Trying to convert a PDF I get the following error, the same options works on other PDFs.
Seems related to pandas.read_csv() on the TSV output of Tesseract.

Encountered an error during conversion of document b137be2685712845d8afee55fe6327d2901290f9a852a25b3f7b19010df64e10:
Traceback (most recent call last):

  File ".../docling/pipeline/base_pipeline.py", line 149, in _build_document
    for p in pipeline_pages:  # Must exhaust!
             ^^^^^^^^^^^^^^

  File ".../docling/pipeline/base_pipeline.py", line 116, in _apply_on_pages
    yield from page_batch

  File ".../docling/models/page_assemble_model.py", line 59, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File ".../docling/models/table_structure_model.py", line 93, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File ".../docling/models/layout_model.py", line 281, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File ".../docling/models/tesseract_ocr_cli_model.py", line 140, in __call__
    df = self._run_tesseract(fname)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../docling/models/tesseract_ocr_cli_model.py", line 98, in _run_tesseract
    df = pd.read_csv(io.StringIO(decoded_data), sep="\t")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory

  File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows

  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows

  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status

  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656

Steps to reproduce

ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = ocr_options

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        )
    }
)

conv_res = converter.convert(Path(my_pdf_path))

Docling version

Docling version: 2.5.2
Docling Core version: 2.4.0
Docling IBM Models version: 2.0.3
Docling Parse version: 2.0.4

Python version

Python 3.12.7

@guglie guglie added the bug Something isn't working label Nov 29, 2024
guglie added a commit to guglie/docling that referenced this issue Nov 29, 2024
@nikos-livathinos
Copy link
Collaborator

@guglie could you please provide the input PDF file to reproduce the issue.

@guglie
Copy link
Contributor Author

guglie commented Dec 2, 2024

@nikos-livathinos I cannot share the original confidential document, but let me generate one for you:
quote-test.pdf

It happens when you have only open quotes at the start of a text block.

Maybe @gaspardpetit can share another file as he had the same error.

dolfim-ibm pushed a commit that referenced this issue Dec 3, 2024
ab-shrek pushed a commit to ab-shrek/docling that referenced this issue Dec 6, 2024
lucas-morin pushed a commit to lucas-morin/docling that referenced this issue Dec 10, 2024
cau-git pushed a commit that referenced this issue Dec 17, 2024
Signed-off-by: guglie <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants