Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Addition of optional visitor-functions in extract_text() #1252

Merged
merged 28 commits into from
Sep 25, 2022

Conversation

srogmann
Copy link
Contributor

@srogmann srogmann commented Aug 18, 2022

This request adds optional visitor-callbacks in extract_text().

_extract_text() calls these visitor-methods while scanning the text-objects of a page. So one can analyze the operations in the page and the positions of the texts.

tests/test_page.py extracts the texts of labels in a Figure and serves as an example how to use this enhancement.

You may use this callbacks to visit all operators and its arguments
and to get the positions of the text-objects.
You may use this to extract the rectangles of a table and the texts
in its cells in some PDF files.
It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.
@MartinThoma
Copy link
Member

Thank you for the contribution ❤️

I didn't expect that we could get text-tokens and their positions in the document in such a rather easy extension. Nice!

I still need to think about this PR / check if there is a performance impact and look at it from a maintenance perspective.

In the meantime, would you mind running black .? You need pip install black; it's a code-formatter that fixes all of the Flake8 issues.

@srogmann
Copy link
Contributor Author

I executed black, it reformatted my changes in _page.py and test_page.py :-).

The function extractTable(listTexts, listRects) uses the function
extractTextAndRectangles(page, rectFilter) which uses the function
extract_text with visitors to extract text in cells of a table.
@pubpub-zz
Copy link
Collaborator

@srogmann,
some extra parameters to be returned to the functions could be useful for some filterig : BaseFont Name and Size (rescaled to the page) ; this would be useful for title extraction for example

You may use this callbacks to visit all operators and its arguments
and to get the positions of the text-objects.
You may use this to extract the rectangles of a table and the texts
in its cells in some PDF files.
It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.
The function extractTable(listTexts, listRects) uses the function
extractTextAndRectangles(page, rectFilter) which uses the function
extract_text with visitors to extract text in cells of a table.
When executing extract_text(...) the optional visitor-function visitor_text
gets the font-dictionary and the font-size.
The font-dictionary contains the font-name and other font properties.
@srogmann
Copy link
Contributor Author

@pubpub-zz
I added the font-dictionary and the font-size in the text-visitor-function. I added the reference to the font-dictionary instead of the BaseFont because I didn't know what might be of further interest.

    def print_visi(text, cm_matrix, tm_matrix, font_dict, font_size):
        if text.strip() != "":
            listTexts.append(
                PositionedText(
                    text, tm_matrix[4], tm_matrix[5], font_dict, font_size
                )
            )

[...]

# Check the fonts. We check: /F2 9.96 Tf [...] [(Dat)-2(e)] TJ
textDatOfDate = listRows[0][0][0]
assert textDatOfDate.font_dict is not None
assert textDatOfDate.font_dict["/Name"] == "/F2"
assert textDatOfDate.font_dict["/BaseFont"] == "/Arial,Bold"
assert textDatOfDate.font_dict["/Encoding"] == "/WinAnsiEncoding"
assert textDatOfDate.font_size == 9.96`

@srogmann
Copy link
Contributor Author

@pubpub-zz
One could add helper classes like PositionedText to support parsing of formatted texts. I used tests/test_page.py as some kind of inkubator ;-).

assert textDat.get_base_font() == "/Arial,Bold"

@MartinThoma
Copy link
Member

@srogmann What is the state of this PR? Do you need help to resolve the merge conflicts / the failing test?

Besides those, is the PR ready in your opinion?

@srogmann
Copy link
Contributor Author

@MartinThoma In my opinion the PR was ready 22 days ago. I will have a look at the current state.

@MartinThoma
Copy link
Member

I'm sorry for the delay; I thought there still was something to be done 🙈

@srogmann
Copy link
Contributor Author

@MartinThoma I merged and resolved conflicts. The PR should work. You may have a look at it.

Each change of the output-result in _extract_text requires a visitor-call in _extract_text:

                    if visitor_text is not None:
                        visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)

There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state.

In tests/test_page.py there are some functions which might be of interest in PyPDF2 or pdfly in future to support using a visitor.
For example one might try to create a svg file:

    def exportSvgFile(listTexts, listRects, fileName):
        import svgwrite

        dwg = svgwrite.Drawing(fileName, profile="tiny")
        color = svgwrite.rgb(255, 0, 0, "%")
        for r in listRects:
            dwg.add(dwg.rect((r.x, r.y), (r.w, r.h), stroke=color, fill_opacity=0.05))
        for t in listTexts:
           dwg.add(dwg.text(t.text, insert=(t.x, t.y), fill="blue"))
        dwg.save()

PyPDF2/_page.py Outdated Show resolved Hide resolved
PyPDF2/_page.py Outdated Show resolved Hide resolved
PyPDF2/_page.py Outdated Show resolved Hide resolved
PyPDF2/_page.py Outdated Show resolved Hide resolved
PyPDF2/_page.py Outdated Show resolved Hide resolved
PyPDF2/_page.py Outdated Show resolved Hide resolved
@srogmann
Copy link
Contributor Author

@MartinThoma I added DictionaryObject in cmaps (my last commit-comment is wrong).

@codecov
Copy link

codecov bot commented Sep 24, 2022

Codecov Report

Base: 94.53% // Head: 94.10% // Decreases project coverage by -0.43% ⚠️

Coverage data is based on head (1969c9f) compared to base (2845c6d).
Patch coverage: 35.13% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1252      +/-   ##
==========================================
- Coverage   94.53%   94.10%   -0.44%     
==========================================
  Files          28       28              
  Lines        5035     5068      +33     
  Branches     1035     1051      +16     
==========================================
+ Hits         4760     4769       +9     
- Misses        165      177      +12     
- Partials      110      122      +12     
Impacted Files Coverage Δ
PyPDF2/_cmap.py 95.08% <ø> (ø)
PyPDF2/_page.py 91.67% <35.13%> (-3.46%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@MartinThoma
Copy link
Member

@srogmann Thank you for your contribution! If you want, I can add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

@py-pdf py-pdf deleted a comment from srogmann Sep 25, 2022
@MartinThoma
Copy link
Member

There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state.

Yes, I think that would make sense. Also getting rid of the use of global / non-local variables and passing data explicitly around might help.

Copy link
Member

@MartinThoma MartinThoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@MartinThoma
Copy link
Member

@pubpub-zz Did you have a look? What do you think about the changes?

If you're good with them as well, I would merge + release :-)

@MartinThoma MartinThoma added is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Sep 25, 2022
@pubpub-zz
Copy link
Collaborator

This sounds good. I had some request earlier that have been fullfiled.
I agree that it is time to release it for user feedbacks

@MartinThoma MartinThoma merged commit ebb3b83 into py-pdf:main Sep 25, 2022
@MartinThoma
Copy link
Member

@srogmann Very nice work 🥳

I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples. I'm super excited to see how people will use it 🎉

@srogmann
Copy link
Contributor Author

@MartinThoma Thanks for merging!

An additation to CONTRIBUTORS.html would be fine.

I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples.

In tests/test/page.py there a two util-classes PositionedText and Rectangle. After renaming they might be useful when one wants to write an own visitor. Documentation and typical examples would be nice. But docs/user/ is contained in the repository, too, so I can think about another pull-request containing some documentation and examples (e.g. the util-classes mentioned and a sample to extract tables or to ignore page-headers).

MartinThoma added a commit that referenced this pull request Sep 25, 2022
New Features (ENH):
-  Addition of optional visitor-functions in extract_text() (#1252)
-  Add metadata.creation_date and modification_date (#1364)
-  Add PageObject.images attribute (#1330)

Bug Fixes (BUG):
-  Lookup index in _xobj_to_image can be ByteStringObject (#1366)
-  \'IndexError: index out of range\' when using extract_text (#1361)
-  Errors in transfer_rotation_to_content() (#1356)

Robustness (ROB):
-  Ensure update_page_form_field_values does not fail if no fields (#1346)

Testing (TST):
-  read_string_from_stream performance (#1355)

Full Changelog: 2.10.9...2.11.0
@srogmann srogmann mentioned this pull request Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants