py-pdf · MartinThoma · Apr 19, 2023 · Apr 10, 2023 · Apr 16, 2023 · Apr 16, 2023
diff --git a/docs/dev/pypdf-parsing.md b/docs/dev/pypdf-parsing.md
@@ -0,0 +1,26 @@
+# How pypdf parses PDF files
+
+pypdf uses {py:class}`PdfReader <pypdf.PdfReader>` to parse PDF files.
+The method {py:method}`PdfReader.reader <pypdf.PdfReader.reader>` shows the basic
+structure of parsing:
+
+1. **Finding and reading the cross-reference tables / trailer**: The
+   cross-reference table (xref table) is a table of byte offsets that indicate
+   the locations of objects within the file. The trailer provides additional
+   information such as the root object (Catalog) and the Info object containing
+   metadata.
+2. **Parsing the objects**: After locating the xref table and the trailer, pypdf
+   proceeds to parse the objects in the PDF. Objects in a PDF can be of various
+   types such as dictionaries, arrays, streams, and simple data types (e.g.,
+   integers, strings). pypdf parses these objects and stores them in
+   {py:method}`PdfReader.resolved_objects <pypdf.PdfReader.resolved_objects>`
+   via {py:method}`cache_indirect_object <pypdf.PdfReader.cache_indirect_object>`.
+3. **Decoding content streams**: The content of a PDF is typically stored in
+   content streams (see "7.8 Content Streams and Resources" in the PDF
+   specification), which are sequences of PDF operators and operands. pypdf
+   decodes these content streams by applying filters (e.g., FlateDecode,
+   LZWDecode) specified in the stream's dictionary. This is only done when the
+   object is requested via {py:method}`PdfReader.get_object
+   <pypdf.PdfReader.get_object>` in the
+   {py:method}`PdfReader._get_object_from_stream
+   <pypdf.PdfReader._get_object_from_stream>` method.
diff --git a/docs/dev/pypdf-writing.md b/docs/dev/pypdf-writing.md
@@ -0,0 +1,68 @@
+# How pypdf writes PDF files
+
+pypdf uses {py:class}`PdfWriter <pypdf.PdfWriter>` to write PDF files. pypdf has
+{py:class}`PdfObject <pypdf.generic.PdfObject>` and several subclasses with the
+{py:meth}`write_to_stream <pypdf.generic.PdfObject.write_to_stream>` method.
+The {py:meth}`PdfWriter.write <pypdf.PdfWriter.write>` method uses the
+`write_to_stream` methods of the referenced objects.
+
+The {py:meth}`PdfWriter.write_stream <pypdf.PdfWriter.write_stream>` method
+has the following core steps:
+
+1. `_sweep_indirect_references`: This step ensures that any circular references
+   to objects are correctly handled. It adds the object reference numbers of any
+   circularly referenced objects to an external reference map, so that
+   self-page-referencing trees can reference the correct new object location,
+   rather than copying in a new copy of the page object.
+2. **Write the File Header and Body** with `_write_pdf_structure`: In this step,
+   the PDF header and objects are written to the output stream. This includes
+   the PDF version (e.g., %PDF-1.7) and the objects that make up the content of
+   the PDF, such as pages, annotations, and form fields. The locations (byte
+   offsets) of these objects are stored for later use in generating the xref
+   table.
+3. **Write the Cross-Reference Table** with `_write_xref_table`: Using the stored
+   object locations, this step generates and writes the cross-reference table
+   (xref table) to the output stream. The cross-reference table contains the
+   byte offsets for each object in the PDF file, allowing for quick random
+   access to objects when reading the PDF.
+4. **Write the File Trailer** with `_write_trailer`: The trailer is written to
+   the output stream in this step. The trailer contains essential information,
+   such as the number of objects in the PDF, the location of the root object
+   (Catalog), and the Info object containing metadata. The trailer also
+   specifies the location of the xref table.
+
+
+## How others do it
+
+Looking at altrnative software designs and implementations can help to improve
+our choices.
+
+### fpdf
+
+[fpdf](https://pypi.org/project/fpdf2/) has a [`PDFObject` class](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/syntax.py)
+with a serialize method which roughly maps to `pypdf.PdfObject.write_to_stream`.
+Some other similarities include:
+
+* [fpdf.output.OutputProducer.buffersize](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/output.py#L370-L485) vs {py:meth}`pypdf.PdfWriter.write_stream <pypdf.PdfWriter.write_stream>`
+* [fpdpf.syntax.Name](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/syntax.py#L124) vs {py:class}`pypdf.generic.NameObject <pypdf.generic.NameObject>`
+* [fpdf.syntax.build_obj_dict](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/syntax.py#L222) vs {py:class}`pypdf.generic.DictionaryObject <pypdf.generic.DictionaryObject>`
+* [fpdf.structure_tree.NumberTree](https://github.com/PyFPDF/fpdf2/blob/master/fpdf/structure_tree.py#L17) vs
+ {py:class}`pypdf.generic.TreeObject <pypdf.generic.TreeObject>`
+
+
+### pdfrw
+
+[pdfrw](https://pypi.org/project/pdfrw/), in contrast, seems to work more with
+the standard Python objects (bool, float, string) and not wrap them in custom
+objects, if possible. It still has:
+
+* [PdfArray](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfarray.py#L13)
+* [PdfDict](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfdict.py#L49)
+* [PdfName](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfname.py#L65)
+* [PdfString](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfstring.py#L322)
+* [PdfIndirect](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/objects/pdfindirect.py#L10)
+
+The core classes of pdfrw are
+[PdfReader](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pdfreader.py#L26)
+and
+[PdfWriter](https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pdfwriter.py#L224)
diff --git a/docs/index.rst b/docs/index.rst
@@ -63,6 +63,8 @@ You can contribute to `pypdf on GitHub <https://github.com/py-pdf/pypdf>`_.
 
    dev/intro
    dev/pdf-format
+   dev/pypdf-parsing
+   dev/pypdf-writing
    dev/cmaps
    dev/deprecations
    dev/documentation

diff --git a/pypdf/_writer.py b/pypdf/_writer.py
@@ -163,8 +163,13 @@ def __init__(
         clone_from: Union[None, PdfReader, StrByteType, Path] = None,
     ) -> None:
         self._header = b"%PDF-1.3"
-        self._objects: List[PdfObject] = []  # array of indirect objects
+
+        self._objects: List[PdfObject] = []
+        """The indirect objects in the PDF."""
+
         self._idnum_hash: Dict[bytes, IndirectObject] = {}
+        """Maps hash values of indirect objects to their IndirectObject instances."""
+
         self._id_translated: Dict[int, Dict[int, int]] = {}
 
         # The root of our page tree node.
@@ -198,6 +203,7 @@ def __init__(
             }
         )
         self._root = self._add_object(self._root_object)
+
         if clone_from is not None:
             if not isinstance(clone_from, PdfReader):
                 clone_from = PdfReader(clone_from)
@@ -1135,20 +1141,11 @@ def write_stream(self, stream: StreamType) -> None:
         if not self._root:
             self._root = self._add_object(self._root_object)
 
-        # PDF objects sometimes have circular references to their /Page objects
-        # inside their object tree (for example, annotations).  Those will be
-        # indirect references to objects that we've recreated in this PDF.  To
-        # address this problem, PageObject's store their original object
-        # reference number, and we add it to the external reference map before
-        # we sweep for indirect references.  This forces self-page-referencing
-        # trees to reference the correct new object location, rather than
-        # copying in a new copy of the page object.
         self._sweep_indirect_references(self._root)
 
         object_positions = self._write_pdf_structure(stream)
         xref_location = self._write_xref_table(stream, object_positions)
-        self._write_trailer(stream)
-        stream.write(b_(f"\nstartxref\n{xref_location}\n%%EOF\n"))  # eof
+        self._write_trailer(stream, xref_location)
 
     def write(self, stream: Union[Path, StrByteType]) -> Tuple[bool, IO]:
         """
@@ -1212,7 +1209,14 @@ def _write_xref_table(self, stream: StreamType, object_positions: List[int]) ->
             stream.write(b_(f"{offset:0>10} {0:0>5} n \n"))
         return xref_location
 
-    def _write_trailer(self, stream: StreamType) -> None:
+    def _write_trailer(self, stream: StreamType, xref_location: int) -> None:
+        """
+        Write the PDF trailer to the stream.
+
+        To quote the PDF specification:
+            [The] trailer [gives] the location of the cross-reference table and
+            of certain special objects within the body of the file.
+        """
         stream.write(b"trailer\n")
         trailer = DictionaryObject()
         trailer.update(
@@ -1227,6 +1231,7 @@ def _write_trailer(self, stream: StreamType) -> None:
         if hasattr(self, "_encrypt"):
             trailer[NameObject(TK.ENCRYPT)] = self._encrypt
         trailer.write_to_stream(stream, None)
+        stream.write(b_(f"\nstartxref\n{xref_location}\n%%EOF\n"))  # eof
 
     def add_metadata(self, infos: Dict[str, Any]) -> None:
         """
@@ -1265,6 +1270,21 @@ def _sweep_indirect_references(
             NullObject,
         ],
     ) -> None:
+        """
+        Resolving any circular references to Page objects.
+
+        Circular references to Page objects can arise when objects such as
+        annotations refer to their associated page. If these references are not
+        properly handled, the PDF file will contain multiple copies of the same
+        Page object. To address this problem, Page objects store their original
+        object reference number. This method adds the reference number of any
+        circularly referenced Page objects to an external reference map. This
+        ensures that self-referencing trees reference the correct new object
+        location, rather than copying in a new copy of the Page object.
+
+        Args:
+            root: The root of the PDF object tree to sweep.
+        """
         stack: Deque[
             Tuple[
                 Any,
@@ -1333,16 +1353,28 @@ def _sweep_indirect_references(
 
     def _resolve_indirect_object(self, data: IndirectObject) -> IndirectObject:
         """
-        Resolves indirect object to this pdf indirect objects.
+        Resolves an indirect object to an indirect object in this PDF file.
+
+        If the input indirect object already belongs to this PDF file, it is
+        returned directly. Otherwise, the object is retrieved from the input
+        object's PDF file using the object's ID number and generation number. If
+        the object cannot be found, a warning is logged and a `NullObject` is
+        returned.
 
-        If it is a new object then it is added to self._objects
-        and new idnum is given and generation is always 0.
+        If the object is not already in this PDF file, it is added to the file's
+        list of objects and assigned a new ID number and generation number of 0.
+        The hash value of the object is then added to the `_idnum_hash`
+        dictionary, with the corresponding `IndirectObject` reference as the
+        value.
 
         Args:
-            data:
+            data: The `IndirectObject` to resolve.
 
         Returns:
-            The resolved indirect object
+            The resolved `IndirectObject` in this PDF file.
+
+        Raises:
+            ValueError: If the input stream is closed.
         """
         if hasattr(data.pdf, "stream") and data.pdf.stream.closed:
             raise ValueError(f"I/O operation on closed file: {data.pdf.stream.name}")