Skip to content

Commit

Permalink
Documentation: Updates to include PyMuPDF4LLM docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
jamie-lemon committed Aug 8, 2024
1 parent afff7b5 commit 07b57a6
Show file tree
Hide file tree
Showing 6 changed files with 266 additions and 13 deletions.
10 changes: 9 additions & 1 deletion docs/header.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,14 @@

<cite>PDF</cite>

.. |PyMuPDF4LLM| raw:: html

<cite>PyMuPDF4LLM</cite>

.. |Markdown| raw:: html

<cite>Markdown</cite>

.. raw:: html

<style>
Expand Down Expand Up @@ -93,7 +101,7 @@
<div style="display:flex;justify-content:space-between;align-items:center;margin-top:20px;">
<div class="discordLink" style="display:flex;align-items:center;margin-top: -5px;">
<a href="https://discord.gg/TSpYGBW4eq" id="findOnDiscord" target=_blank>Find <b>#pymupdf</b> on <b>Discord</b></a>
<a href="https://discord.gg/TSpYGBW4eq" target=_blank><img src="_images/discord-mark-blue.svg" alt="Discord logo" /></a>
<a href="https://discord.gg/TSpYGBW4eq" target=_blank><img src="https://pymupdf.readthedocs.io/en/latest/_images/discord-mark-blue.svg" alt="Discord logo" /></a>
</div>

<div class="feedbackLink"><a id="feedbackLinkTop" target=_blank>Do you have any feedback on this page?</b></a></div>
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ This documentation covers all versions up to |version|.
:maxdepth: 1

about.rst
pymupdf4llm/index.rst
pymupdf-pro.rst


Expand Down
29 changes: 20 additions & 9 deletions docs/pymupdf-pro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,26 @@ Office file support

In addition to the `standard file types supported by PyMuPDF <Supported_File_Types>`, |PyMuPDF Pro| supports:

- DOC/DOCX
- PPT/PPTX
- XLS/XLSX
- HWP/HWPX
.. list-table::
:header-rows: 1

* - **DOC/DOCX**
- **XLS/XLSX**
- **PPT/PPTX**
- **HWP/HWPX**
* - .. image:: images/icons/icon-docx.svg
:width: 40
:height: 40
- .. image:: images/icons/icon-xlsx.svg
:width: 40
:height: 40
- .. image:: images/icons/icon-pptx.svg
:width: 40
:height: 40
- .. image:: images/icons/icon-hangul.svg
:width: 40
:height: 40



Usage
Expand All @@ -34,11 +50,6 @@ Usage
Installation
~~~~~~~~~~~~~~~~~~

.. note::

|PyMuPDF Pro| is only available for Linux & Windows platforms.


Install via pip with:

.. code-block:: bash
Expand Down
82 changes: 82 additions & 0 deletions docs/pymupdf4llm/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
.. include:: ../header.rst



.. _pymupdf4llm-api:


API
===========================================================================

The |PyMuPDF4LLM| API
--------------------------

.. property:: version

Prints the version of the library.

.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, margins=(0, 50, 0, 50), page_chunks: bool = False) -> str | list[dict]

Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists support for building page chunks from the |Markdown| text.

:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| Document (created via `pymupdf.open`).

:arg list,range pages: optional, the pages to consider for output. If omitted all pages are processed.

:arg hdr_info: optional, a callable (or an object having a method named `hdr_info`) which accepts a text span and delivers a string of 0 up to 6 "#" characters which should be used to identify headers in the markdown text. If omitted, a full document scan will be performed to find the most popular font sizes and derive header levels based on this. For instance, to avoid generating any lines tagged as headers specify `hdr_info=lambda s: ""`.

:arg bool write_images: when encountering images or vector graphics, PNG images will be generated from the respective page area. Markdown references will be generated pointing to these images. Any text contained in these areas will not be included in the output. Therefore, if your document has text written on full page images, make sure to set this parameter to `False`.

:arg float,list margins: a float or a list of up to 4 floats specifying page borders. If 4 floats are provided, they are assumed to be the values left, top, right, bottom, in this sequence. Only content below top and above bottom, etc. will be considered for processing. If a single float value is provided, it will be taken as the value for all 4 border values. A pair of numbers is assumed to specify top and bottom.

:arg bool page_chunks: if `True` the output will be a list of `Document.page_count` dictionaries (one per page). Each dictionary has the following structure:

- **"metadata"** - a dictionary consisting of the document's metadata `Document.metadata <https://pymupdf.readthedocs.io/en/latest/document.html#Document.metadata>`_, enriched with additional keys **"file_path"** (the file name), **"page_count"** (number of pages in document), and **"page_number"** (1-based page number).

- **"toc_items"** - a list of Table of Contents items pointing to this page. Each item of this list has the format `[lvl, title, pagenumber]`, where `lvl` is the hierachy level, `title` a string and `pagenumber` the 12-based page number.

- **"tables"** - a list of tables on this page. Each item is a dictionary with keys "bbox", "row_count" and "col_count". Key "bbox" is a `pymupdf.Rect` in tuple format of the table's position on the page.

- **"images"** - a list of images on the page. This a copy of page method :meth:`Page.get_image_info`.

- **"graphics"** - a list of vector graphics rectangles on the page. This is a list of boundary boxes of clustered vector graphics as delivered by method :meth:`Page.cluster_drawings`.

- **"text"** - page content as |Markdown| text.

:returns: Either a string of the combined text of all selected document pages or a list of dictionaries.

.. method:: LlamaMarkdownReader(*args, **kwargs)

Create a `pdf_markdown_reader.PDFMarkdownReader` using the `LlamaIndex`_ package. Please note that this package will **not automatically be installed** when installing **pymupdf4llm**.

For details on the possible arguments, please consult the LlamaIndex documentation [#f1]_.

:raises: `NotImplementedError`: Please install required `LlamaIndex`_ package.
:returns: a `pdf_markdown_reader.PDFMarkdownReader` and issues message "Successfully imported LlamaIndex". Please note that this method needs several seconds to execute. For details on using the markdown reader please see below.

----


.. class:: pdf_markdown_reader.PDFMarkdownReader

.. method:: load_data(file_path: Union[Path, str], extra_info: Optional[Dict] = None, **load_kwargs: Any) -> List[LlamaIndexDocument]

This is the only method of the markdown reader you should currently use to extract markdown data. Please in any case ignore methods `aload_data()` and `lazy_load_data()`. Other methods like `use_doc_meta()` may or may not make sense. For more information, please consult the LlamaIndex documentation [#f1]_.

Under the hood the method will execute `to_markdown()`.

:returns: a list of `LlamaIndexDocument` documents - one for each page.


.. rubric:: Footnotes

.. [#f1] `LlamaIndex documentation <https://docs.llamaindex.ai/en/stable/>`_
.. include:: ../footer.rst

.. _LlamaIndex: https://pypi.org/project/llama-index/



151 changes: 151 additions & 0 deletions docs/pymupdf4llm/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@

.. include:: ../header.rst

.. _pymupdf4llm
PyMuPDF4LLM
===========================================================================

|PyMuPDF4LLM| is aimed to make it easier to extract **PDF** content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown extraction <extracting_as_md>` as well as :ref:`LlamaIndex document output <extracting_as_llamaindex>`.

.. important::

You can extend the supported file types to also include **Office** document formats (DOC/DOCX, XLS/XLSX, PPT/PPTX, HWP/HWPX) by :ref:`using PyMuPDF Pro with PyMuPDF4LLM <using_pymupdf4llm_withpymupdfpro>`.

Features
-------------------------------

- Support for multi-column pages
- Support for image and vector graphics extraction (and inclusion of references in the MD text)
- Support for page chunking output.
- Direct support for output as :ref:`LlamaIndex Documents <extracting_as_llamaindex>`.


Functionality
--------------------

- This package converts the pages of a file to text in **Markdown** format using |PyMuPDF|.

- Standard text and tables are detected, brought in the right reading sequence and then together converted to **GitHub**-compatible **Markdown** text.

- Header lines are identified via the font size and appropriately prefixed with one or more `#` tags.

- Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.

- By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of `0`-based page numbers.


Installation
----------------


Install the package via **pip** with:


.. code-block:: bash
pip install pymupdf4llm
.. _extracting_as_md:

Extracting a file as **Markdown**
--------------------------------------------------------------

To retrieve your document content in **Markdown** simply install the package and then use a couple of lines of **Python** code to get results.



Then in your **Python** script do:


.. code-block:: python
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")
.. note::

Instead of the filename string as above, one can also provide a :ref:`PyMuPDF Document <Document>`. A second parameter may be a list of `0`-based page numbers, e.g. `[0,1]` would just select the first and second pages of the document.


If you want to store your **Markdown** file, e.g. store as a UTF8-encoded file, then do:


.. code-block:: python
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())
.. _extracting_as_llamaindex:

Extracting a file as a **LlamaIndex** document
--------------------------------------------------------------

|PyMuPDF4LLM| supports direct conversion to a **LLamaIndex** document. A document is first converted into **Markdown** format and then a **LlamaIndex** document is returned as follows:



.. code-block:: python
import pymupdf4llm
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data("input.pdf")
.. _using_pymupdf4llm_withpymupdfpro:

Using with |PyMuPDF Pro|
---------------------------


For **Office** document support, |PyMuPDF4LLM| works seamlessly with |PyMuPDF Pro|. Assuming you have :doc:`../pymupdf-pro` installed you will be able to work with **Office** documents as expected:


.. code-block:: python
import pymupdf4llm
import pymupdf.pro
pymupdf.pro.unlock()
md_text = pymupdf4llm.to_markdown("sample.doc")
As you can see |PyMuPDF Pro| functionality will be available within the |PyMuPDF4LLM| context!



API
-------

See :ref:`the PyMuPDF4LLM API <pymupdf4llm-api>`.

Further Resources
-------------------


Sample code
~~~~~~~~~~~~~~~

- `Command line RAG Chatbot with PyMuPDF <https://github.com/pymupdf/RAG/tree/main/country-capitals>`_
- `Example of a Browser Application using Langchain and PyMuPDF <https://github.com/pymupdf/RAG/tree/main/GUI>`_


Blogs
~~~~~~~~~~~~~~

- `RAG/LLM and PDF: Enhanced Text Extraction <https://artifex.com/blog/rag-llm-and-pdf-enhanced-text-extraction>`_
- `Creating a RAG Chatbot with ChatGPT and PyMuPDF <https://artifex.com/blog/creating-a-rag-chatbot-with-chatgpt-and-pymupdf>`_
- `Building a RAG Chatbot GUI with the ChatGPT API and PyMuPDF <https://artifex.com/blog/building-a-rag-chatbot-gui-with-the-chatgpt-api-and-pymupdf>`_
- `RAG/LLM and PDF: Conversion to Markdown Text with PyMuPDF <https://artifex.com/blog/rag-llm-and-pdf-conversion-to-markdown-text-with-pymupdf>`_







.. include:: ../footer.rst
6 changes: 3 additions & 3 deletions docs/rag.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ If you need to export to :title:`Markdown` or obtain a :title:`LlamaIndex` Docum

.. raw:: html

<button id="pymupdf4llmButton" class="cta orange" style="text-transform: none;" onclick="window.location='https://pymupdf4llm.readthedocs.io'">Try PyMuPDF4LLM</button>
<button id="pymupdf4llmButton" class="cta orange" style="text-transform: none;" onclick="window.location='pymupdf4llm/'">Try PyMuPDF4LLM</button>
<p></p>

<script>
Expand Down Expand Up @@ -70,7 +70,7 @@ Chunking (or splitting) data is essential to give context to your :title:`LLM` d
Outputting as :title:`Markdown`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to export your document in :title:`Markdown` format you will need a separate helper. Package `pymupdf4llm <https://pypi.org/project/pymupdf4llm/>`_ is a high-level wrapper of |PyMuPDF| functions which for each page outputs standard and table text in an integrated Markdown-formatted string across all document pages:
In order to export your document in :title:`Markdown` format you will need a separate helper. Package :doc:`pymupdf4llm/index` is a high-level wrapper of |PyMuPDF| functions which for each page outputs standard and table text in an integrated Markdown-formatted string across all document pages:


.. code-block:: python
Expand All @@ -84,7 +84,7 @@ In order to export your document in :title:`Markdown` format you will need a sep
pathlib.Path("output.md").write_bytes(md_text.encode())
For further information please refer to: `pymupdf4llm documentation <https://pymupdf4llm.readthedocs.io>`_
For further information please refer to: :doc:`pymupdf4llm/index`.


How to use :title:`Markdown` output
Expand Down

0 comments on commit 07b57a6

Please sign in to comment.