ENH: Improve space setting for text extraction #922

MartinThoma · 2022-05-29T10:46:12Z

Full credit to pubpub-zz who introduced this change in
#881

Co-authored-by: pubpub-zz [email protected]

MartinThoma · 2022-05-29T10:50:23Z

Overall, this is a big improvement. However, for the following files it became worse

it became worse.

Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <[email protected]>

codecov · 2022-05-29T10:54:52Z

Codecov Report

Merging #922 (449442e) into main (c59224a) will decrease coverage by 0.28%.
The diff coverage is 30.00%.

@@            Coverage Diff             @@
##             main     #922      +/-   ##
==========================================
- Coverage   77.82%   77.54%   -0.29%     
==========================================
  Files          16       16              
  Lines        4329     4346      +17     
  Branches      813      821       +8     
==========================================
+ Hits         3369     3370       +1     
- Misses        788      796       +8     
- Partials      172      180       +8

Impacted Files	Coverage Δ
PyPDF2/_page.py	`76.49% <30.00%> (-2.88%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c59224a...449442e. Read the comment docs.

The 2.0.0 release of PyPDF2 includes three core changes: 1. Dropping support for Python 3.5 and older. 2. Introducing type annotations. 3. Interface changes, mostly to have PEP8-compliant names We introduced a [deprecation process](#930) that hopefully helps users to avoid unexpected breaking changes. Breaking Changes(DEP): - PyPDF2 2.0 requires Python 3.6+. Python 2.7 and 3.5 support were dropped. - PdfFileReader: The "warndest" parameter was removed - PdfFileReader and PdfFileMerger no longer have the `overwriteWarnings` parameter. The new behavior is `overwriteWarnings=False`. - merger: OutlinesObject was removed without replacement. - merger.py ➔ _merger.py: You must import PdfFileMerger from PyPDF2 directly. - utils: * `ConvertFunctionsToVirtualList` was removed * `formatWarning` was removed * `isInt(obj)`: Use `instance(obj, int)` instead * `u_(s)`: Use `s` directly * `chr_(c)`: Use `chr(c)` instead * `barray(b)`: Use `bytearray(b)` instead * `isBytes(b)`: Use `instance(b, type(bytes()))` instead * `xrange_fn`: Use `range` instead * `string_type`: Use `str` instead * `isString(s)`: Use `instance(s, str)` instead * `_basestring`: Use `str` instead * All Exceptions are now in `PyPDF2.errors`: - PageSizeNotDefinedError - PdfReadError - PdfReadWarning - PyPdfError - `PyPDF2.pdf` (the `pdf` module) no longer exists. The contents were moved with the library. You should most likely import directly from `PyPDF2` instead. The `RectangleObject` is in `PyPDF2.generic`. - The `Resources`, `Scripts`, and `Tests` will no longer be part of the distribution files on PyPI. This should have little to no impact on most people. The `Tests` are renamed to `tests`, the `Resources` are renamed to `resources`. Both are still in the git repository. The `Scripts` are now in https://github.com/py-pdf/cpdf. `Sample_Code` was moved to the `docs`. For a full list of deprecated functions, please see the changelog of version 1.28.0. New Features (ENH): - Improve space setting for text extraction (#922) - Allow setting the decryption password in PdfReader.__init__ (#920) - Add Page.add_transformation (#883) Bug Fixes (BUG): - Fix error adding transformation to page without /Contents (#908) Robustness (ROB): - Cope with invalid length in streams (#861) Documentation (DOC): - Fix style of 1.25 and 1.27 patch notes (#927) - Transformation (#907) Developer Experience (DEV): - Create flake8 config file (#916) - Use relative imports (#875) Maintenance (MAINT): - Use Python 3.6 language features (#849) - Add wrapper function for PendingDeprecationWarnings (#928) - Use new PEP8 compliant names (#884) - Explicitly represent transformation matrix (#878) - Inline PAGE_RANGE_HELP string (#874) - Remove unnecessary generics imports (#873) - Remove star imports (#865) - merger.py ➔ _merger.py (#864) - Type annotations for all functions/methods (#854) - Add initial type support with mypy (#853) Testing (TST): - Regression test for xmp_metadata converter (#923) - Checkout submodule sample-files for benchmark - Add text extracting performance benchmark - Use new PyPDF2 API in benchmark (#902) - Make test suite fail for uncaught warnings (#892) - Remove -OO testrun from CI (#901) - Improve tests for convert_to_int (#899) Full Changelog: 1.28.4...2.0.0

MartinThoma mentioned this pull request May 29, 2022

Improve Text Extraction #881

Closed

ENH: Improve space setting for text extraction

dee0106

Full credit to pubpub-zz who introduced this change in #881 Co-authored-by: pubpub-zz <[email protected]>

MartinThoma force-pushed the space-improvements branch from fc10e6e to dee0106 Compare May 29, 2022 10:52

Add references

449442e

MartinThoma merged commit c008b0f into main May 29, 2022

MartinThoma deleted the space-improvements branch May 29, 2022 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Improve space setting for text extraction #922

ENH: Improve space setting for text extraction #922

MartinThoma commented May 29, 2022

MartinThoma commented May 29, 2022

codecov bot commented May 29, 2022 •

edited

Loading

ENH: Improve space setting for text extraction #922

ENH: Improve space setting for text extraction #922

Conversation

MartinThoma commented May 29, 2022

MartinThoma commented May 29, 2022

codecov bot commented May 29, 2022 • edited Loading

Codecov Report

codecov bot commented May 29, 2022 •

edited

Loading