Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong layout extraction (texts blended between two columns) #660

Closed
jeromemassot opened this issue Dec 30, 2024 · 2 comments
Closed

Wrong layout extraction (texts blended between two columns) #660

jeromemassot opened this issue Dec 30, 2024 · 2 comments
Assignees
Labels
bug Something isn't working PDF parsing

Comments

@jeromemassot
Copy link

Bug

The text extracted is interleaved from the two columns of the original paper.
You can retrieve the paper at https://pmc.ncbi.nlm.nih.gov/articles/PMC113777

Output:
=== 0 ===
('Journal of Biomedicine and Biotechnology · 1:2 (2001)89-90 · PII. '
'S111072430100016X · http://jbb.hindawi.com/ CORRESPONDENCEARTICLE')
=== 1 ===
'The role of neuraminidase inhibitors in the treatment'
=== 2 ===
'and prevention of influenza'
=== 3 ===
('Naem Shahrour * American International Health Council, 414 South Craig St. '
'#269, Pittsburgh, PA 15213, USA The causative agents of acute respiratory '
'infections in children and adults are mostly thought to be viruses. Many '
'types of viruses could cause similar symptoms of ARI. Among them, influenza '
'viruses A and B and respiratory syncytial virus (RSV)are thought to be the '
'most important because of the severity of illness after infection and their '
'high communicability in the human population [1]. Type C influenza virus '
'modalities have been investigated and some have already been introduced to '
'practice. The most recent breakthrough in that direction is the introduction '
'of neuraminidase inhibitors for the treatment and prevention of influenza '
'infection. The neuraminidase is a surface glycoprotein that is composed of '
'eleven conserved residues [4] which catalyze the cleavage of sialic acid '
'residues terminally linked to glycopro-')

Third TextItem interleaves text from the two columns.

Steps to reproduce

def convert_doc(path):

    # define the conversion options
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.ocr_options.lang = ["en"]
    pipeline_options.accelerator_options = AcceleratorOptions(
        num_threads=8, device=AcceleratorDevice.AUTO
    )

    # create the doc converter
    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    # convert the document
    result = doc_converter.convert(path)
    return result

doc = convert_doc(os.path.join("PMC113777.pdf")).document

i = 0
for item, level in doc.iterate_items():
    if isinstance(item, TextItem):
        print(f"=== {i} ===")
        pprint.pprint(item.text)
        i += 1

Docling version

2.14.0

Python version

3.12.3

@jeromemassot jeromemassot added the bug Something isn't working label Dec 30, 2024
@MahmoudAtef4499
Copy link

Me too.

@cau-git cau-git self-assigned this Jan 6, 2025
@cau-git
Copy link
Contributor

cau-git commented Jan 8, 2025

@jeromemassot confirmed, this is apparently a bug in docling-parse. I opened an issue to track it there, and I will close this issue.

For verification, you can run this to visualize the problem:

docling --debug-visualize-cells --debug-visualize-layout S111072430100016X.pdf    

and this workaround will avoid the problem by using the pypdfium backend instead:

docling --debug-visualize-cells --debug-visualize-layout --pdf-backend pypdfium2 S111072430100016X.pdf

@cau-git cau-git closed this as completed Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PDF parsing
Projects
None yet
Development

No branches or pull requests

3 participants