pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656 #470

guglie · 2024-11-29T16:21:56Z

Bug

Trying to convert a PDF I get the following error, the same options works on other PDFs.
Seems related to pandas.read_csv() on the TSV output of Tesseract.

Encountered an error during conversion of document b137be2685712845d8afee55fe6327d2901290f9a852a25b3f7b19010df64e10:
Traceback (most recent call last):

  File ".../docling/pipeline/base_pipeline.py", line 149, in _build_document
    for p in pipeline_pages:  # Must exhaust!
             ^^^^^^^^^^^^^^

  File ".../docling/pipeline/base_pipeline.py", line 116, in _apply_on_pages
    yield from page_batch

  File ".../docling/models/page_assemble_model.py", line 59, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File ".../docling/models/table_structure_model.py", line 93, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File ".../docling/models/layout_model.py", line 281, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File ".../docling/models/tesseract_ocr_cli_model.py", line 140, in __call__
    df = self._run_tesseract(fname)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../docling/models/tesseract_ocr_cli_model.py", line 98, in _run_tesseract
    df = pd.read_csv(io.StringIO(decoded_data), sep="\t")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory

  File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows

  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows

  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status

  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656

Steps to reproduce

ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = ocr_options

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        )
    }
)

conv_res = converter.convert(Path(my_pdf_path))

Docling version

Docling version: 2.5.2
Docling Core version: 2.4.0
Docling IBM Models version: 2.0.3
Docling Parse version: 2.0.4

Python version

Python 3.12.7

The text was updated successfully, but these errors were encountered:

Signed-off-by: guglie <[email protected]>

nikos-livathinos · 2024-12-02T14:38:56Z

@guglie could you please provide the input PDF file to reproduce the issue.

guglie · 2024-12-02T17:53:02Z

@nikos-livathinos I cannot share the original confidential document, but let me generate one for you:
quote-test.pdf

It happens when you have only open quotes at the start of a text block.

Maybe @gaspardpetit can share another file as he had the same error.

Signed-off-by: guglie <[email protected]>

Signed-off-by: guglie <[email protected]> Signed-off-by: Christoph Auer <[email protected]>

guglie added the bug Something isn't working label Nov 29, 2024

guglie added a commit to guglie/docling that referenced this issue Nov 29, 2024

fix: ParserError EOF inside string (DS4SD#470)

a523059

Signed-off-by: guglie <[email protected]>

This was referenced Nov 29, 2024

fix: ParserError EOF inside string (#470) #472

Merged

fix: tesseract_ocr_cli csv parsing fails when text contains single quotes #482

Closed

dolfim-ibm pushed a commit that referenced this issue Dec 3, 2024

fix: ParserError EOF inside string (#470) (#472)

c90c41c

Signed-off-by: guglie <[email protected]>

dolfim-ibm closed this as completed in #472 Dec 3, 2024

ab-shrek pushed a commit to ab-shrek/docling that referenced this issue Dec 6, 2024

fix: ParserError EOF inside string (DS4SD#470) (DS4SD#472)

598455b

Signed-off-by: guglie <[email protected]>

lucas-morin pushed a commit to lucas-morin/docling that referenced this issue Dec 10, 2024

fix: ParserError EOF inside string (DS4SD#470) (DS4SD#472)

d1244a5

Signed-off-by: guglie <[email protected]>

cau-git pushed a commit that referenced this issue Dec 17, 2024

fix: ParserError EOF inside string (#470) (#472)

a7e3f71

Signed-off-by: guglie <[email protected]> Signed-off-by: Christoph Auer <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656 #470

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656 #470

guglie commented Nov 29, 2024

nikos-livathinos commented Dec 2, 2024

guglie commented Dec 2, 2024

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656 #470

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656 #470

Comments

guglie commented Nov 29, 2024

Bug

Steps to reproduce

Docling version

Python version

nikos-livathinos commented Dec 2, 2024

guglie commented Dec 2, 2024