Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Document core mechanics of pypdf #1783

Merged
merged 13 commits into from
Apr 19, 2023
26 changes: 26 additions & 0 deletions docs/dev/pypdf-parsing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# How pypdf parses PDF files

pypdf uses {py:class}`PdfReader <pypdf.PdfReader>` to parse PDF files.
The method {py:method}`PdfReader.reader <pypdf.PdfReader.reader>` shows the basic
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved
structure of parsing:

1. **Finding and reading the cross-reference tables / trailer**: The
cross-reference table (xref table) is a table of byte offsets that indicate
the locations of objects within the file. The trailer provides additional
information such as the root object (Catalog) and the Info object containing
metadata.
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved
2. **Parsing the objects**: After locating the xref table and the trailer, pypdf
proceeds to parse the objects in the PDF. Objects in a PDF can be of various
types such as dictionaries, arrays, streams, and simple data types (e.g.,
integers, strings). pypdf parses these objects and stores them in
{py:method}`PdfReader.resolved_objects <pypdf.PdfReader.resolved_objects>`
via {py:method}`cache_indirect_object <pypdf.PdfReader.cache_indirect_object>`.
3. **Decoding content streams**: The content of a PDF is typically stored in
content streams (see "7.8 Content Streams and Resources" in the PDF
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved
specification), which are sequences of PDF operators and operands. pypdf
decodes these content streams by applying filters (e.g., FlateDecode,
LZWDecode) specified in the stream's dictionary. This is only done when the
object is requested via {py:method}`PdfReader.get_object
<pypdf.PdfReader.get_object>` in the
{py:method}`PdfReader._get_object_from_stream
<pypdf.PdfReader._get_object_from_stream>` method.
68 changes: 68 additions & 0 deletions docs/dev/pypdf-writing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# How pypdf writes PDF files

pypdf uses {py:class}`PdfWriter <pypdf.PdfWriter>` to write PDF files. pypdf has
{py:class}`PdfObject <pypdf.generic.PdfObject>` and several subclasses with the
{py:meth}`write_to_stream <pypdf.generic.PdfObject.write_to_stream>` method.
The {py:meth}`PdfWriter.write <pypdf.PdfWriter.write>` method uses the
`write_to_stream` methods of the referenced objects.

The {py:meth}`PdfWriter.write_stream <pypdf.PdfWriter.write_stream>` method
has the following core steps:

1. `_sweep_indirect_references`: This step ensures that any circular references
to objects are correctly handled. It adds the object reference numbers of any
circularly referenced objects to an external reference map, so that
self-page-referencing trees can reference the correct new object location,
rather than copying in a new copy of the page object.
2. **Write the File Header and Body** with `_write_pdf_structure`: In this step,
the PDF header and objects are written to the output stream. This includes
the PDF version (e.g., %PDF-1.7) and the objects that make up the content of
the PDF, such as pages, annotations, and form fields. The locations (byte
offsets) of these objects are stored for later use in generating the xref
table.
3. **Write the Cross-Reference Table** with `_write_xref_table`: Using the stored
object locations, this step generates and writes the cross-reference table
(xref table) to the output stream. The cross-reference table contains the
byte offsets for each object in the PDF file, allowing for quick random
access to objects when reading the PDF.
4. **Write the File Trailer** with `_write_trailer`: The trailer is written to
the output stream in this step. The trailer contains essential information,
such as the number of objects in the PDF, the location of the root object
(Catalog), and the Info object containing metadata. The trailer also
specifies the location of the xref table.


## How others do it

Looking at altrnative software designs and implementations can help to improve
our choices.

### fpdf
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved

[fpdf](https://pypi.org/project/fpdf2/) has a [`PDFObject` class](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/syntax.py)
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved
with a serialize method which roughly maps to `pypdf.PdfObject.write_to_stream`.
Some other similarities include:

* [fpdf.output.OutputProducer.buffersize](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/output.py#L370-L485) vs {py:meth}`pypdf.PdfWriter.write_stream <pypdf.PdfWriter.write_stream>`
* [fpdpf.syntax.Name](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/syntax.py#L124) vs {py:class}`pypdf.generic.NameObject <pypdf.generic.NameObject>`
* [fpdf.syntax.build_obj_dict](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/syntax.py#L222) vs {py:class}`pypdf.generic.DictionaryObject <pypdf.generic.DictionaryObject>`
* [fpdf.structure_tree.NumberTree](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/structure_tree.py#L17) vs
{py:class}`pypdf.generic.TreeObject <pypdf.generic.TreeObject>`


### pdfrw

[pdfrw](https://pypi.org/project/pdfrw/), in contrast, seems to work more with
the standard Python objects (bool, float, string) and not wrap them in custom
objects, if possible. It still has:

* [PdfArray](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfarray.py#L13)
* [PdfDict](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfdict.py#L49)
* [PdfName](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfname.py#L65)
* [PdfString](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfstring.py#L322)
* [PdfIndirect](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfindirect.py#L10)

The core classes of pdfrw are
[PdfReader](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pdfreader.py#L26)
and
[PdfWriter](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pdfwriter.py#L224)
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ You can contribute to `pypdf on GitHub <https://github.com/py-pdf/pypdf>`_.

dev/intro
dev/pdf-format
dev/pypdf-parsing
dev/pypdf-writing
dev/cmaps
dev/deprecations
dev/documentation
Expand Down
66 changes: 49 additions & 17 deletions pypdf/_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,8 +163,13 @@ def __init__(
clone_from: Union[None, PdfReader, StrByteType, Path] = None,
) -> None:
self._header = b"%PDF-1.3"
self._objects: List[PdfObject] = [] # array of indirect objects

self._objects: List[PdfObject] = []
"""The indirect objects in the PDF."""

self._idnum_hash: Dict[bytes, IndirectObject] = {}
"""Maps hash values of indirect objects to their IndirectObject instances."""

self._id_translated: Dict[int, Dict[int, int]] = {}

# The root of our page tree node.
Expand Down Expand Up @@ -198,6 +203,7 @@ def __init__(
}
)
self._root = self._add_object(self._root_object)

if clone_from is not None:
if not isinstance(clone_from, PdfReader):
clone_from = PdfReader(clone_from)
Expand Down Expand Up @@ -1135,20 +1141,11 @@ def write_stream(self, stream: StreamType) -> None:
if not self._root:
self._root = self._add_object(self._root_object)

# PDF objects sometimes have circular references to their /Page objects
# inside their object tree (for example, annotations). Those will be
# indirect references to objects that we've recreated in this PDF. To
# address this problem, PageObject's store their original object
# reference number, and we add it to the external reference map before
# we sweep for indirect references. This forces self-page-referencing
# trees to reference the correct new object location, rather than
# copying in a new copy of the page object.
self._sweep_indirect_references(self._root)

object_positions = self._write_pdf_structure(stream)
xref_location = self._write_xref_table(stream, object_positions)
self._write_trailer(stream)
stream.write(b_(f"\nstartxref\n{xref_location}\n%%EOF\n")) # eof
self._write_trailer(stream, xref_location)

def write(self, stream: Union[Path, StrByteType]) -> Tuple[bool, IO]:
"""
Expand Down Expand Up @@ -1212,7 +1209,14 @@ def _write_xref_table(self, stream: StreamType, object_positions: List[int]) ->
stream.write(b_(f"{offset:0>10} {0:0>5} n \n"))
return xref_location

def _write_trailer(self, stream: StreamType) -> None:
def _write_trailer(self, stream: StreamType, xref_location: int) -> None:
"""
Write the PDF trailer to the stream.

To quote the PDF specification:
[The] trailer [gives] the location of the cross-reference table and
of certain special objects within the body of the file.
"""
stream.write(b"trailer\n")
trailer = DictionaryObject()
trailer.update(
Expand All @@ -1227,6 +1231,7 @@ def _write_trailer(self, stream: StreamType) -> None:
if hasattr(self, "_encrypt"):
trailer[NameObject(TK.ENCRYPT)] = self._encrypt
trailer.write_to_stream(stream, None)
stream.write(b_(f"\nstartxref\n{xref_location}\n%%EOF\n")) # eof

def add_metadata(self, infos: Dict[str, Any]) -> None:
"""
Expand Down Expand Up @@ -1265,6 +1270,21 @@ def _sweep_indirect_references(
NullObject,
],
) -> None:
"""
Resolving any circular references to Page objects.

Circular references to Page objects can arise when objects such as
annotations refer to their associated page. If these references are not
properly handled, the PDF file will contain multiple copies of the same
Page object. To address this problem, Page objects store their original
object reference number. This method adds the reference number of any
circularly referenced Page objects to an external reference map. This
ensures that self-referencing trees reference the correct new object
location, rather than copying in a new copy of the Page object.

Args:
root: The root of the PDF object tree to sweep.
"""
stack: Deque[
Tuple[
Any,
Expand Down Expand Up @@ -1333,16 +1353,28 @@ def _sweep_indirect_references(

def _resolve_indirect_object(self, data: IndirectObject) -> IndirectObject:
"""
Resolves indirect object to this pdf indirect objects.
Resolves an indirect object to an indirect object in this PDF file.

If the input indirect object already belongs to this PDF file, it is
returned directly. Otherwise, the object is retrieved from the input
object's PDF file using the object's ID number and generation number. If
the object cannot be found, a warning is logged and a `NullObject` is
returned.

If it is a new object then it is added to self._objects
and new idnum is given and generation is always 0.
If the object is not already in this PDF file, it is added to the file's
list of objects and assigned a new ID number and generation number of 0.
The hash value of the object is then added to the `_idnum_hash`
dictionary, with the corresponding `IndirectObject` reference as the
value.

Args:
data:
data: The `IndirectObject` to resolve.

Returns:
The resolved indirect object
The resolved `IndirectObject` in this PDF file.

Raises:
ValueError: If the input stream is closed.
"""
if hasattr(data.pdf, "stream") and data.pdf.stream.closed:
raise ValueError(f"I/O operation on closed file: {data.pdf.stream.name}")
Expand Down