-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Addition of optional visitor-functions in extract_text() #1252
Conversation
You may use this callbacks to visit all operators and its arguments and to get the positions of the text-objects. You may use this to extract the rectangles of a table and the texts in its cells in some PDF files.
It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.
Thank you for the contribution ❤️ I didn't expect that we could get text-tokens and their positions in the document in such a rather easy extension. Nice! I still need to think about this PR / check if there is a performance impact and look at it from a maintenance perspective. In the meantime, would you mind running |
I executed black, it reformatted my changes in _page.py and test_page.py :-). |
The function extractTable(listTexts, listRects) uses the function extractTextAndRectangles(page, rectFilter) which uses the function extract_text with visitors to extract text in cells of a table.
@srogmann, |
You may use this callbacks to visit all operators and its arguments and to get the positions of the text-objects. You may use this to extract the rectangles of a table and the texts in its cells in some PDF files.
It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.
The function extractTable(listTexts, listRects) uses the function extractTextAndRectangles(page, rectFilter) which uses the function extract_text with visitors to extract text in cells of a table.
When executing extract_text(...) the optional visitor-function visitor_text gets the font-dictionary and the font-size. The font-dictionary contains the font-name and other font properties.
@pubpub-zz
[...]
|
@pubpub-zz
|
@srogmann What is the state of this PR? Do you need help to resolve the merge conflicts / the failing test? Besides those, is the PR ready in your opinion? |
@MartinThoma In my opinion the PR was ready 22 days ago. I will have a look at the current state. |
I'm sorry for the delay; I thought there still was something to be done 🙈 |
@MartinThoma I merged and resolved conflicts. The PR should work. You may have a look at it. Each change of the output-result in _extract_text requires a visitor-call in _extract_text:
There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state. In tests/test_page.py there are some functions which might be of interest in PyPDF2 or pdfly in future to support using a visitor.
|
@MartinThoma I added DictionaryObject in cmaps (my last commit-comment is wrong). |
Codecov ReportBase: 94.53% // Head: 94.10% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1252 +/- ##
==========================================
- Coverage 94.53% 94.10% -0.44%
==========================================
Files 28 28
Lines 5035 5068 +33
Branches 1035 1051 +16
==========================================
+ Hits 4760 4769 +9
- Misses 165 177 +12
- Partials 110 122 +12
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@srogmann Thank you for your contribution! If you want, I can add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-) |
Yes, I think that would make sense. Also getting rid of the use of global / non-local variables and passing data explicitly around might help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
@pubpub-zz Did you have a look? What do you think about the changes? If you're good with them as well, I would merge + release :-) |
This sounds good. I had some request earlier that have been fullfiled. |
@srogmann Very nice work 🥳 I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples. I'm super excited to see how people will use it 🎉 |
@MartinThoma Thanks for merging! An additation to CONTRIBUTORS.html would be fine.
In tests/test/page.py there a two util-classes PositionedText and Rectangle. After renaming they might be useful when one wants to write an own visitor. Documentation and typical examples would be nice. But docs/user/ is contained in the repository, too, so I can think about another pull-request containing some documentation and examples (e.g. the util-classes mentioned and a sample to extract tables or to ignore page-headers). |
New Features (ENH): - Addition of optional visitor-functions in extract_text() (#1252) - Add metadata.creation_date and modification_date (#1364) - Add PageObject.images attribute (#1330) Bug Fixes (BUG): - Lookup index in _xobj_to_image can be ByteStringObject (#1366) - \'IndexError: index out of range\' when using extract_text (#1361) - Errors in transfer_rotation_to_content() (#1356) Robustness (ROB): - Ensure update_page_form_field_values does not fail if no fields (#1346) Testing (TST): - read_string_from_stream performance (#1355) Full Changelog: 2.10.9...2.11.0
This request adds optional visitor-callbacks in
extract_text()
._extract_text()
calls these visitor-methods while scanning the text-objects of apage
. So one can analyze the operations in the page and the positions of the texts.tests/test_page.py
extracts the texts of labels in a Figure and serves as an example how to use this enhancement.