Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Improve space setting for text extraction #922

Merged
merged 2 commits into from
May 29, 2022
Merged

Conversation

MartinThoma
Copy link
Member

Full credit to pubpub-zz who introduced this change in
#881

Co-authored-by: pubpub-zz [email protected]

@MartinThoma
Copy link
Member Author

Overall, this is a big improvement. However, for the following files it became worse

it became worse.

Full credit to pubpub-zz who introduced this change in
#881

Co-authored-by: pubpub-zz <[email protected]>
@MartinThoma MartinThoma force-pushed the space-improvements branch from fc10e6e to dee0106 Compare May 29, 2022 10:52
@codecov
Copy link

codecov bot commented May 29, 2022

Codecov Report

Merging #922 (449442e) into main (c59224a) will decrease coverage by 0.28%.
The diff coverage is 30.00%.

@@            Coverage Diff             @@
##             main     #922      +/-   ##
==========================================
- Coverage   77.82%   77.54%   -0.29%     
==========================================
  Files          16       16              
  Lines        4329     4346      +17     
  Branches      813      821       +8     
==========================================
+ Hits         3369     3370       +1     
- Misses        788      796       +8     
- Partials      172      180       +8     
Impacted Files Coverage Δ
PyPDF2/_page.py 76.49% <30.00%> (-2.88%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c59224a...449442e. Read the comment docs.

@MartinThoma MartinThoma merged commit c008b0f into main May 29, 2022
@MartinThoma MartinThoma deleted the space-improvements branch May 29, 2022 12:14
MartinThoma added a commit that referenced this pull request Jun 1, 2022
The 2.0.0 release of PyPDF2 includes three core changes:

1. Dropping support for Python 3.5 and older.
2. Introducing type annotations.
3. Interface changes, mostly to have PEP8-compliant names

We introduced a [deprecation process](#930)
that hopefully helps users to avoid unexpected breaking changes.

Breaking Changes(DEP):
- PyPDF2 2.0 requires Python 3.6+. Python 2.7 and 3.5 support were dropped.
- PdfFileReader: The "warndest" parameter was removed
- PdfFileReader and PdfFileMerger no longer have the `overwriteWarnings`
  parameter. The new behavior is `overwriteWarnings=False`.
- merger: OutlinesObject was removed without replacement.
- merger.py ➔ _merger.py: You must import PdfFileMerger from PyPDF2 directly.
- utils:
  * `ConvertFunctionsToVirtualList` was removed
  * `formatWarning` was removed
  * `isInt(obj)`: Use `instance(obj, int)` instead
  * `u_(s)`: Use `s` directly
  * `chr_(c)`: Use `chr(c)` instead
  * `barray(b)`: Use `bytearray(b)` instead
  * `isBytes(b)`: Use `instance(b, type(bytes()))` instead
  * `xrange_fn`: Use `range` instead
  * `string_type`: Use `str` instead
  * `isString(s)`: Use `instance(s, str)` instead
  * `_basestring`: Use `str` instead
  * All Exceptions are now in `PyPDF2.errors`:
    - PageSizeNotDefinedError
    - PdfReadError
    - PdfReadWarning
    - PyPdfError
- `PyPDF2.pdf` (the `pdf` module) no longer exists. The contents were moved with
  the library. You should most likely import directly from `PyPDF2` instead.
  The `RectangleObject` is in `PyPDF2.generic`.
- The `Resources`, `Scripts`, and `Tests` will no longer be part of the distribution
  files on PyPI. This should have little to no impact on most people. The
  `Tests` are renamed to `tests`, the `Resources` are renamed to `resources`.
  Both are still in the git repository. The `Scripts` are now in
  https://github.com/py-pdf/cpdf. `Sample_Code` was moved to the `docs`.

For a full list of deprecated functions, please see the changelog of version
1.28.0.

New Features (ENH):
-  Improve space setting for text extraction (#922)
-  Allow setting the decryption password in PdfReader.__init__ (#920)
-  Add Page.add_transformation (#883)

Bug Fixes (BUG):
-  Fix error adding transformation to page without /Contents (#908)

Robustness (ROB):
-  Cope with invalid length in streams (#861)

Documentation (DOC):
-  Fix style of 1.25 and 1.27 patch notes (#927)
-  Transformation (#907)

Developer Experience (DEV):
-  Create flake8 config file (#916)
-  Use relative imports (#875)

Maintenance (MAINT):
-  Use Python 3.6 language features (#849)
-  Add wrapper function for PendingDeprecationWarnings (#928)
-  Use new PEP8 compliant names (#884)
-  Explicitly represent transformation matrix (#878)
-  Inline PAGE_RANGE_HELP string (#874)
-  Remove unnecessary generics imports (#873)
-  Remove star imports (#865)
-  merger.py ➔ _merger.py (#864)
-  Type annotations for all functions/methods (#854)
-  Add initial type support with mypy (#853)

Testing (TST):
-  Regression test for xmp_metadata converter (#923)
-  Checkout submodule sample-files for benchmark
-  Add text extracting performance benchmark
-  Use new PyPDF2 API in benchmark (#902)
-  Make test suite fail for uncaught warnings (#892)
-  Remove -OO testrun from CI (#901)
-  Improve tests for convert_to_int (#899)

Full Changelog: 1.28.4...2.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant