Skip to content

Commit

Permalink
TST: Add test for layout_mode_font_height_weight of ``PageObject.…
Browse files Browse the repository at this point in the history
…extract_text()``
  • Loading branch information
hpierre001 committed Oct 23, 2024
1 parent dad1788 commit c3dae7b
Show file tree
Hide file tree
Showing 3 changed files with 85 additions and 0 deletions.
19 changes: 19 additions & 0 deletions resources/crazyones_layout_vertical_space.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
The Crazy Ones
October 14, 1998

Heres to the crazy ones. The misfits. The rebels. The troublemakers.
The round pegs in the square holes.
The ones who see things differently. Theyre not fond of rules. And
they have no respect for the status quo. You can quote them,
disagree with them, glorify or vilify them.
About the only thing you cant do is ignore them. Because they change
things. They invent. They imagine. They heal. They explore. They
create. They inspire. They push the human race forward.
Maybe they have to be crazy.
How else can you stare at an empty canvas and see a work of art? Or
sit in silence and hear a song thats never been written? Or gaze at
a red planet and see a laboratory on wheels?
We make tools for these kinds of people.
While some see them as the crazy ones, we see genius. Because the
people who are crazy enough to think they can change the world,
are the ones who do.
25 changes: 25 additions & 0 deletions resources/crazyones_layout_vertical_space_font_height_weight.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
The Crazy Ones
October 14, 1998

Heres to the crazy ones. The misfits. The rebels. The troublemakers.
The round pegs in the square holes.

The ones who see things differently. Theyre not fond of rules. And
they have no respect for the status quo. You can quote them,
disagree with them, glorify or vilify them.

About the only thing you cant do is ignore them. Because they change
things. They invent. They imagine. They heal. They explore. They
create. They inspire. They push the human race forward.

Maybe they have to be crazy.

How else can you stare at an empty canvas and see a work of art? Or
sit in silence and hear a song thats never been written? Or gaze at
a red planet and see a laboratory on wheels?

We make tools for these kinds of people.

While some see them as the crazy ones, we see genius. Because the
people who are crazy enough to think they can change the world,
are the ones who do.
41 changes: 41 additions & 0 deletions tests/test_text_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,3 +219,44 @@ def test_text_leading_height_unit():
page = reader.pages[0]
extracted = page.extract_text()
assert "Something[cited]\n" in extracted


def test_layout_mode_space_vertically_font_height_weight():
"""Tests layout mode with vertical space and font height weight (issue #2915)"""
with open(RESOURCE_ROOT / "crazyones.pdf", "rb") as inputfile:
# Load PDF file from file
reader = PdfReader(inputfile)
page = reader.pages[0]

# Normal behaviour
with open(RESOURCE_ROOT / "crazyones_layout_vertical_space.txt", "rb") as pdftext_file:
pdftext = pdftext_file.read()

text = page.extract_text(extraction_mode="layout", layout_mode_space_vertically=True).encode("utf-8")

# Compare the text of the PDF to a known source
for expected_line, actual_line in zip(text.splitlines(), pdftext.splitlines()):
assert expected_line == actual_line

pdftext = pdftext.replace(b"\r\n", b"\n") # fix for windows
assert text == pdftext, (
"PDF extracted text differs from expected value.\n\n"
"Expected:\n\n%r\n\nExtracted:\n\n%r\n\n" % (pdftext, text)
)

# Blank lines are added to truly separate paragraphs
with open(RESOURCE_ROOT / "crazyones_layout_vertical_space_font_height_weight.txt", "rb") as pdftext_file:
pdftext = pdftext_file.read()

text = page.extract_text(extraction_mode="layout", layout_mode_space_vertically=True,
layout_mode_font_height_weight=0.85).encode("utf-8")

# Compare the text of the PDF to a known source
for expected_line, actual_line in zip(text.splitlines(), pdftext.splitlines()):
assert expected_line == actual_line

pdftext = pdftext.replace(b"\r\n", b"\n") # fix for windows
assert text == pdftext, (
"PDF extracted text differs from expected value.\n\n"
"Expected:\n\n%r\n\nExtracted:\n\n%r\n\n" % (pdftext, text)
)

0 comments on commit c3dae7b

Please sign in to comment.