Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: py-pdf/pypdf
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 5.1.0
Choose a base ref
...
head repository: py-pdf/pypdf
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 5.2.0
Choose a head ref
Loading
Showing with 1,469 additions and 888 deletions.
  1. +19 −6 .github/workflows/github-ci.yaml
  2. +20 −20 .github/workflows/title-check.yml
  3. +6 −25 .pre-commit-config.yaml
  4. +56 −0 CHANGELOG.md
  5. +1 −0 CONTRIBUTORS.md
  6. +7 −1 docs/dev/pdf-format.md
  7. +2 −2 docs/meta/project-governance.md
  8. +0 −5 docs/modules/PdfDocCommon.md
  9. +12 −0 docs/modules/PdfDocCommon.rst
  10. +15 −0 docs/user/adding-pdf-annotations.md
  11. +4 −0 docs/user/file-size.md
  12. +11 −12 docs/user/reading-pdf-annotations.md
  13. +1 −2 docs/user/streaming-data.md
  14. +2 −2 make_release.py
  15. +9 −9 pypdf/__init__.py
  16. +15 −7 pypdf/_cmap.py
  17. +5 −5 pypdf/_codecs/__init__.py
  18. +7 −7 pypdf/_crypt_providers/__init__.py
  19. +1 −1 pypdf/_crypt_providers/_fallback.py
  20. +86 −36 pypdf/_doc_common.py
  21. +12 −11 pypdf/_encryption.py
  22. +24 −26 pypdf/_page.py
  23. +51 −51 pypdf/_reader.py
  24. +1 −1 pypdf/_text_extraction/_layout_mode/__init__.py
  25. +5 −2 pypdf/_text_extraction/_layout_mode/_fixed_width_page.py
  26. +17 −21 pypdf/_utils.py
  27. +1 −1 pypdf/_version.py
  28. +165 −154 pypdf/_writer.py
  29. +15 −20 pypdf/_xobj_image_helpers.py
  30. +4 −7 pypdf/annotations/__init__.py
  31. +4 −4 pypdf/annotations/_base.py
  32. +3 −3 pypdf/annotations/_markup_annotations.py
  33. +11 −24 pypdf/constants.py
  34. +2 −2 pypdf/errors.py
  35. +73 −53 pypdf/filters.py
  36. +20 −26 pypdf/generic/__init__.py
  37. +22 −12 pypdf/generic/_base.py
  38. +20 −4 pypdf/generic/_data_structures.py
  39. +2 −1 pypdf/generic/_image_inline.py
  40. +2 −2 pypdf/generic/_outline.py
  41. +31 −31 pypdf/generic/_viewerpref.py
  42. +87 −73 pyproject.toml
  43. +1 −1 requirements/ci-3.11.txt
  44. +1 −2 requirements/dev.in
  45. +39 −54 requirements/dev.txt
  46. +1 −1 requirements/docs.txt
  47. BIN resources/bytes.pdf
  48. +5 −5 tests/__init__.py
  49. +12 −18 tests/example_files.yaml
  50. +21 −0 tests/scripts/test_make_release.py
  51. +3 −2 tests/test_annotations.py
  52. +35 −1 tests/test_cmap.py
  53. +2 −1 tests/test_constants.py
  54. +71 −0 tests/test_doc_common.py
  55. +56 −6 tests/test_filters.py
  56. +33 −9 tests/test_generic.py
  57. +11 −0 tests/test_images.py
  58. +12 −12 tests/test_merger.py
  59. +12 −14 tests/test_page.py
  60. +26 −15 tests/test_reader.py
  61. +73 −2 tests/test_text_extraction.py
  62. +8 −6 tests/test_utils.py
  63. +49 −50 tests/test_workflows.py
  64. +141 −15 tests/test_writer.py
  65. +5 −4 tests/test_xmp.py
  66. +1 −1 tests/test_xobject_image_helpers.py
25 changes: 19 additions & 6 deletions .github/workflows/github-ci.yaml
Original file line number Diff line number Diff line change
@@ -30,7 +30,7 @@ jobs:
- name: Setup Python (3.11+)
uses: actions/setup-python@v5
with:
python-version: 3.12 # latest stable python
python-version: 3.13 # latest stable python
allow-prereleases: true
- name: Upgrade pip
run: |
@@ -54,7 +54,7 @@ jobs:
tests:
name: "pytest on ${{ matrix.python-version }} (crypto-lib: ${{ matrix.use-crypto-lib }})"
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12", "3.13"]
@@ -70,7 +70,7 @@ jobs:
sudo apt-get update
- name: Install APT dependencies
run:
sudo apt-get install ghostscript
sudo apt-get install ghostscript poppler-utils
- name: Checkout Code
uses: actions/checkout@v4
with:
@@ -121,12 +121,25 @@ jobs:
- name: Install pypdf
run: |
pip install .
- name: Prepare
- name: Download test files
run: |
python -c "from tests import download_test_pdfs; download_test_pdfs()"
- name: Setup sitecustomize.py for coverage
run: |
SITE_PACKAGES="$(python -m site --user-site)"
SITECUSTOMIZE_PATH="$SITE_PACKAGES/sitecustomize.py"
mkdir -p $SITE_PACKAGES
touch $SITECUSTOMIZE_PATH
echo "try:" >> $SITECUSTOMIZE_PATH
echo " import coverage" >> $SITECUSTOMIZE_PATH
echo " coverage.process_startup()" >> $SITECUSTOMIZE_PATH
echo "except ImportError:" >> $SITECUSTOMIZE_PATH
echo " pass" >> $SITECUSTOMIZE_PATH
- name: Test with pytest
run: |
python -m pytest tests --cov=pypdf --cov-append -n auto -vv
env:
COVERAGE_PROCESS_START: 'pyproject.toml'
- name: Rename coverage data file
run: mv .coverage ".coverage.$RANDOM"
- name: Upload coverage data
@@ -139,7 +152,7 @@ jobs:

codestyle:
name: Check code style issues
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- name: Checkout Code
uses: actions/checkout@v4
@@ -228,7 +241,7 @@ jobs:
python -m coverage combine
python -m coverage xml
- name: Upload Coverage to Codecov
uses: codecov/codecov-action@v4
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./coverage.xml
40 changes: 20 additions & 20 deletions .github/workflows/title-check.yml
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
name: 'PR Title Check'
on:
pull_request:
# check when PR
# * is created,
# * title is edited, and
# * new commits are added (to ensure failing title blocks merging)
types: [opened, reopened, edited, synchronize]

jobs:
title-check:
name: Title check
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v4
- name: Check PR title
env:
PR_TITLE: ${{ github.event.pull_request.title }}
run: python .github/scripts/check_pr_title.py
name: 'PR Title Check'
on:
pull_request:
# check when PR
# * is created,
# * title is edited, and
# * new commits are added (to ensure failing title blocks merging)
types: [opened, reopened, edited, synchronize]

jobs:
title-check:
name: Title check
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v4
- name: Check PR title
env:
PR_TITLE: ${{ github.event.pull_request.title }}
run: python .github/scripts/check_pr_title.py
31 changes: 6 additions & 25 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pre-commit run --all-files
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.1.0
rev: v5.0.0
hooks:
- id: check-ast
- id: check-byte-order-marker
@@ -18,40 +18,21 @@ repos:
- id: check-added-large-files
args: ['--maxkb=1000']

- repo: https://github.com/psf/black
rev: 23.3.0
hooks:
- id: black
args: [--target-version, py37]

- repo: https://github.com/asottile/blacken-docs
rev: 1.14.0
hooks:
- id: blacken-docs
additional_dependencies: [black==22.1.0]
exclude: "docs/user/robustness.md"

- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.1.9
rev: v0.7.0
hooks:
- id: ruff
args: ['--fix']

- repo: https://github.com/asottile/pyupgrade
rev: v3.3.2
rev: v3.19.0
hooks:
- id: pyupgrade
args: [--py37-plus]

- repo: https://github.com/pycqa/flake8
rev: 5.0.4
hooks:
- id: flake8
args: ["--ignore", "E,W,F"]
args: [--py38-plus]

- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v1.4.0'
rev: 'v1.13.0'
hooks:
- id: mypy
additional_dependencies: [types-Pillow==10.0.0.2]
additional_dependencies: [types-Pillow==10.2.0.20240822]
files: ^pypdf/.*
56 changes: 56 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,61 @@
# CHANGELOG

## Version 5.2.0, 2025-01-26

### Deprecations (DEP)
- Deprecate with replacement CCITParameters (#3019)
- Correct deprecation of interiour_color (#2947)

### New Features (ENH)
- Support alternative (U)F names for embedded file retrieval (#3072)
- Adding support for reading .metadata.keywords (#2939)

### Bug Fixes (BUG)
- Handle further Tf operators in text extraction layout mode (#3073)
- Ensure `add_metadata` can deal with `_info = None` (#3040)
- Handle IndirectObject in CCITTFaxDecode filter (#2965)
- Handle chained colorspace for inline images when no filter is set (#3008)
- Avoid extracting inline images twice and dropping other operators (#3002)
- Fixed reference of value with `str.__new__` in TextStringObject (#2952)
- Handle indirect objects in font width calculations (#2967)
- Title sometimes is bytes and not str (#2930)
- Fix undefined variable for text extraction (regression) (#2934)
- Don't close stream passed to PdfWriter.write() (#2909)

### Robustness (ROB)
- Handle zero height fonts when extracting text (#3075)
- Deal with content streams not containing streams (#3005)
- Gracefully handle some text operators when the operands are missing (#3006)
- Fall back to non-Adobe Ascii85 format for missing end markers (#3007)
- Ignore odd-length strings when processing cmap lines (#3009)
- Skip annotation destination being NullObject in PdfWriter (#2964)
- Skip destination page being None in PdfWriter (#2963)
- Fix infinite loop case when reading null objects within an Array
- Fixing infinite loop in ArrayObject read_from_stream (#2928)

### Documentation (DOC)
- Add note about default line colors (#3014)

### Developer Experience (DEV)
- Remove ignoring Ruff rule PGH004 (#3071)
- Tidy ignore array in tool.ruff.lint (#3069)
- Move Windows CI to Python 3.13 (#3003)
- Move to Ubuntu 22.04 (#3004)

### Maintenance (MAINT)
- Fix formatting of warning message and include exception message (#3076)
- Narrow return type for `ContentStream.operations` (#2941)

### Testing (TST)
- Fix image similarity for upcoming Ubuntu 24.04 (#3039)
- Replace broken Apache Tika Corpora urls (#3041)

### Code Style (STY)
- Add form feed to WHITESPACES (#3054)
- Lots of small internal changes

[Full Changelog](https://github.com/py-pdf/pypdf/compare/5.1.0...5.2.0)

## Version 5.1.0, 2024-10-27

### New Features (ENH)
1 change: 1 addition & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
@@ -36,6 +36,7 @@ history and [GitHub's 'Contributors' feature](https://github.com/py-pdf/pypdf/gr
* [maxbeer99](https://github.com/maxbeer99)
* [McNeil, Karen](https://github.com/karenlmcneil): Arabic Language Support
* [Mérino, Antoine](https://github.com/Merinorus)
* [Murphy, Kevin](https://github.com/kmurphy4)
* [nalin-udhaar](https://github.com/nalin-udhaar)
* [Paramonov, Alexey](https://github.com/alexey-v-paramonov)
* [Paternault, Louis](https://framagit.org/spalax)
8 changes: 7 additions & 1 deletion docs/dev/pdf-format.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
# The PDF Format

It is recommended to look in the PDF specification for details and clarifications.
This is only intended to give a very rough overview of the format.

* [PDF Specification Archive](https://pdfa.org/resource/pdf-specification-archive/)
* [Portable Document Format Reference Manual, 1993. ISBN 0-201-62628-4](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf)
* [ISO 32000-1:2008 (PDF 1.7)](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf)
* ISO 32000-2:2020 (PDF 2.0)

Below is only intended to give a very rough overview of the format.

## Overall Structure

4 changes: 2 additions & 2 deletions docs/meta/project-governance.md
Original file line number Diff line number Diff line change
@@ -19,7 +19,7 @@ capable of splitting, merging, cropping, and transforming the pages of PDF files
request, but that is up to the maintainer. Other contributors describe issues,
help to ask questions on existing issues to make them easier to answer,
participate in discussions, and help to improve the documentation. Contributors
are similar to maintainers, but without technial permissions.
are similar to maintainers, but without technical permissions.
* A **user** is a person who imports pypdf into their code. All pypdf users
are developers, but not developers who know the internals of pypdf. They only
use the public interface of pypdf. They will likely have less knowledge about
@@ -111,7 +111,7 @@ An issue is any technical description that aims at bringing pypdf forward:
* Performance tickets: pypdf could be faster - let us know about your specific
scenario.

Any comment that is in those technial descriptions which is not helping the
Any comment that is in those technical descriptions which is not helping the
discussion can be deleted. This is especially true for "me too" comments on bugs
or "bump" comments for desired features. People can express this with 👍 / 👎
reactions.
5 changes: 0 additions & 5 deletions docs/modules/PdfDocCommon.md

This file was deleted.

12 changes: 12 additions & 0 deletions docs/modules/PdfDocCommon.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
The PdfDocCommon Class
----------------------

**PdfDocCommon** is an abstract class which is inherited by :class:`~pypdf.PdfReader` and :class:`~pypdf.PdfWriter`.

Where identified in the API, you can use any of the derived class.

.. autoclass:: pypdf._doc_common.PdfDocCommon
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
15 changes: 15 additions & 0 deletions docs/user/adding-pdf-annotations.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
# Adding PDF Annotations

```{note}
By default, some annotations might be invisible, for example polylines, as the default color is "transparent".
To circumvent this, make sure to add the `/C` entry to the annotation, being an array and each array value being in the range 0.0 to 1.0:
* With one element, a grayscale value.
* With three elements, a RGB definition.
* With four elements, a CMYK definition.
```

## Attachments

```python
@@ -108,6 +118,7 @@ you can use {class}`~pypdf.annotations.PolyLine`:
```python
from pypdf import PdfReader, PdfWriter
from pypdf.annotations import PolyLine
from pypdf.generic import ArrayObject, FloatObject, NameObject

pdf_path = os.path.join(RESOURCE_ROOT, "crazyones.pdf")
reader = PdfReader(pdf_path)
@@ -116,9 +127,13 @@ writer = PdfWriter()
writer.add_page(page)

# Add the polyline
# By default, the line will be transparent. Set an explicit color.
annotation = PolyLine(
vertices=[(50, 550), (200, 650), (70, 750), (50, 700)],
)
annotation[NameObject("/C")] = ArrayObject(
[FloatObject(0.9), FloatObject(0.1), FloatObject(0)]
)
writer.add_annotation(page_number=0, annotation=annotation)

# Write the annotated file to disk
4 changes: 4 additions & 0 deletions docs/user/file-size.md
Original file line number Diff line number Diff line change
@@ -105,3 +105,7 @@ becomes useless because there is only one source for all pages.
Cropping is an ineffective method for reducing the file size because it only
adjusts the viewboxes and not the external parts of the source image. Therefore,
the content that is no longer visible will still be present in the PDF.

## Going Further

The presentation [Putting a Squeeze on Your PDF](https://youtube.com/watch?v=tgOABUhVwFs) has other suggestions. One takeaway is that most of the significant size optimizations usually come from image and font modification. However, font optimization, such as replacing, merging, and subsetting, is not within the functionality of pypdf at the moment.
23 changes: 11 additions & 12 deletions docs/user/reading-pdf-annotations.md
Original file line number Diff line number Diff line change
@@ -40,10 +40,9 @@ reader = PdfReader("annotated.pdf")

for page in reader.pages:
if "/Annots" in page:
for annot in page["/Annots"]:
obj = annot.get_object()
annotation = {"subtype": obj["/Subtype"], "location": obj["/Rect"]}
print(annotation)
for annotation in page["/Annots"]:
obj = annotation.get_object()
print({"subtype": obj["/Subtype"], "location": obj["/Rect"]})
```

Examples of reading three of the most common annotations:
@@ -57,10 +56,10 @@ reader = PdfReader("example.pdf")

for page in reader.pages:
if "/Annots" in page:
for annot in page["/Annots"]:
subtype = annot.get_object()["/Subtype"]
for annotation in page["/Annots"]:
subtype = annotation.get_object()["/Subtype"]
if subtype == "/Text":
print(annot.get_object()["/Contents"])
print(annotation.get_object()["/Contents"])
```

## Highlights
@@ -72,10 +71,10 @@ reader = PdfReader("example.pdf")

for page in reader.pages:
if "/Annots" in page:
for annot in page["/Annots"]:
subtype = annot.get_object()["/Subtype"]
for annotation in page["/Annots"]:
subtype = annotation.get_object()["/Subtype"]
if subtype == "/Highlight":
coords = annot.get_object()["/QuadPoints"]
coords = annotation.get_object()["/QuadPoints"]
x1, y1, x2, y2, x3, y3, x4, y4 = coords
```

@@ -90,8 +89,8 @@ attachments = {}
for page in reader.pages:
if "/Annots" in page:
for annotation in page["/Annots"]:
subtype = annot.get_object()["/Subtype"]
subtype = annotation.get_object()["/Subtype"]
if subtype == "/FileAttachment":
fileobj = annot.get_object()["/FS"]
fileobj = annotation.get_object()["/FS"]
attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].get_data()
```
Loading