You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Output:
=== 0 ===
('Journal of Biomedicine and Biotechnology · 1:2 (2001)89-90 · PII. '
'S111072430100016X · http://jbb.hindawi.com/ CORRESPONDENCEARTICLE')
=== 1 ===
'The role of neuraminidase inhibitors in the treatment'
=== 2 ===
'and prevention of influenza'
=== 3 ===
('Naem Shahrour * American International Health Council, 414 South Craig St. '
'#269, Pittsburgh, PA 15213, USA The causative agents of acute respiratory '
'infections in children and adults are mostly thought to be viruses. Many '
'types of viruses could cause similar symptoms of ARI. Among them, influenza '
'viruses A and B and respiratory syncytial virus (RSV)are thought to be the '
'most important because of the severity of illness after infection and their '
'high communicability in the human population [1]. Type C influenza virus '
'modalities have been investigated and some have already been introduced to '
'practice. The most recent breakthrough in that direction is the introduction '
'of neuraminidase inhibitors for the treatment and prevention of influenza '
'infection. The neuraminidase is a surface glycoprotein that is composed of '
'eleven conserved residues [4] which catalyze the cleavage of sialic acid '
'residues terminally linked to glycopro-')
Third TextItem interleaves text from the two columns.
Steps to reproduce
def convert_doc(path):
# define the conversion options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options.lang = ["en"]
pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=8, device=AcceleratorDevice.AUTO
)
# create the doc converter
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
# convert the document
result = doc_converter.convert(path)
return result
doc = convert_doc(os.path.join("PMC113777.pdf")).document
i = 0
for item, level in doc.iterate_items():
if isinstance(item, TextItem):
print(f"=== {i} ===")
pprint.pprint(item.text)
i += 1
Docling version
2.14.0
Python version
3.12.3
The text was updated successfully, but these errors were encountered:
Bug
The text extracted is interleaved from the two columns of the original paper.
You can retrieve the paper at https://pmc.ncbi.nlm.nih.gov/articles/PMC113777
Output:
=== 0 ===
('Journal of Biomedicine and Biotechnology · 1:2 (2001)89-90 · PII. '
'S111072430100016X · http://jbb.hindawi.com/ CORRESPONDENCEARTICLE')
=== 1 ===
'The role of neuraminidase inhibitors in the treatment'
=== 2 ===
'and prevention of influenza'
=== 3 ===
('Naem Shahrour * American International Health Council, 414 South Craig St. '
'#269, Pittsburgh, PA 15213, USA The causative agents of acute respiratory '
'infections in children and adults are mostly thought to be viruses. Many '
'types of viruses could cause similar symptoms of ARI. Among them, influenza '
'viruses A and B and respiratory syncytial virus (RSV)are thought to be the '
'most important because of the severity of illness after infection and their '
'high communicability in the human population [1]. Type C influenza virus '
'modalities have been investigated and some have already been introduced to '
'practice. The most recent breakthrough in that direction is the introduction '
'of neuraminidase inhibitors for the treatment and prevention of influenza '
'infection. The neuraminidase is a surface glycoprotein that is composed of '
'eleven conserved residues [4] which catalyze the cleavage of sialic acid '
'residues terminally linked to glycopro-')
Third TextItem interleaves text from the two columns.
Steps to reproduce
Docling version
2.14.0
Python version
3.12.3
The text was updated successfully, but these errors were encountered: