-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PageObject.extract_text
s text_visitor
reports a wrong matrix for some text nodes
#2513
Comments
After doing some debugging, I found that the
We see that for the text node that has the correct text matrix, both Lines 1798 to 1800 in 6cf47c5
This It seems to me that the condition Git blame shows that this line was last modified in commit bcd85c4. Reverting to 3.16.2 (the last release before this change) gives the correct output for the example, but it's broken in 3.16.3. Since this commit is the only commit that touched text extraction between 3.16.2 and 3.16.3, I think it's safe to say that this issue is a regression caused by commit bcd85c4. |
Thanks for the analysis. This appears to be a duplicate of #2353 in this case. |
While trying to extract lemmas from this page, I found that some text "nodes" (not sure what the technical term is, I'll refer to them as nodes in this issue) are passed to
visitor_text
with seemingly wrongmatrix
values.Environment
Code + PDF
This is a minimal, complete example that shows the issue. Observe (using a PDF reader) that the nodes
ZURRA˓A, KHIRBE
andT EL
appear next to each other. Also save the script below (toexample.py
for example) and run it, passing the path to the attached pdf as first parameter.Observe that the output is:
I expected the last two elements of the
T EL
node to be the x and y position of the node (which pdfbox shows to be177.92
and687.12
respectively).I also noticed that pdfbox seems to indicate the text in the node is
T EL
, but pdfpy reportsT EL
(note the leading space). Is pdfpy mistakenly adding a leading space?Files
The sample PDF used with this is a page from a PDF version of the Anchor Bible Dictionary: zurra_page.pdf
This page in pdfbox's debugger, which clearly shows the coordinates of the
T EL
node:Traceback
There is no exception raised, so there also is no traceback.
The text was updated successfully, but these errors were encountered: