-
Notifications
You must be signed in to change notification settings - Fork 544
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Documentation: Updates to include PyMuPDF4LLM docs.
- Loading branch information
1 parent
afff7b5
commit 07b57a6
Showing
6 changed files
with
266 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
.. include:: ../header.rst | ||
|
||
|
||
|
||
.. _pymupdf4llm-api: | ||
|
||
|
||
API | ||
=========================================================================== | ||
|
||
The |PyMuPDF4LLM| API | ||
-------------------------- | ||
|
||
.. property:: version | ||
|
||
Prints the version of the library. | ||
|
||
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, margins=(0, 50, 0, 50), page_chunks: bool = False) -> str | list[dict] | ||
|
||
Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists support for building page chunks from the |Markdown| text. | ||
|
||
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| Document (created via `pymupdf.open`). | ||
|
||
:arg list,range pages: optional, the pages to consider for output. If omitted all pages are processed. | ||
|
||
:arg hdr_info: optional, a callable (or an object having a method named `hdr_info`) which accepts a text span and delivers a string of 0 up to 6 "#" characters which should be used to identify headers in the markdown text. If omitted, a full document scan will be performed to find the most popular font sizes and derive header levels based on this. For instance, to avoid generating any lines tagged as headers specify `hdr_info=lambda s: ""`. | ||
|
||
:arg bool write_images: when encountering images or vector graphics, PNG images will be generated from the respective page area. Markdown references will be generated pointing to these images. Any text contained in these areas will not be included in the output. Therefore, if your document has text written on full page images, make sure to set this parameter to `False`. | ||
|
||
:arg float,list margins: a float or a list of up to 4 floats specifying page borders. If 4 floats are provided, they are assumed to be the values left, top, right, bottom, in this sequence. Only content below top and above bottom, etc. will be considered for processing. If a single float value is provided, it will be taken as the value for all 4 border values. A pair of numbers is assumed to specify top and bottom. | ||
|
||
:arg bool page_chunks: if `True` the output will be a list of `Document.page_count` dictionaries (one per page). Each dictionary has the following structure: | ||
|
||
- **"metadata"** - a dictionary consisting of the document's metadata `Document.metadata <https://pymupdf.readthedocs.io/en/latest/document.html#Document.metadata>`_, enriched with additional keys **"file_path"** (the file name), **"page_count"** (number of pages in document), and **"page_number"** (1-based page number). | ||
|
||
- **"toc_items"** - a list of Table of Contents items pointing to this page. Each item of this list has the format `[lvl, title, pagenumber]`, where `lvl` is the hierachy level, `title` a string and `pagenumber` the 12-based page number. | ||
|
||
- **"tables"** - a list of tables on this page. Each item is a dictionary with keys "bbox", "row_count" and "col_count". Key "bbox" is a `pymupdf.Rect` in tuple format of the table's position on the page. | ||
|
||
- **"images"** - a list of images on the page. This a copy of page method :meth:`Page.get_image_info`. | ||
|
||
- **"graphics"** - a list of vector graphics rectangles on the page. This is a list of boundary boxes of clustered vector graphics as delivered by method :meth:`Page.cluster_drawings`. | ||
|
||
- **"text"** - page content as |Markdown| text. | ||
|
||
:returns: Either a string of the combined text of all selected document pages or a list of dictionaries. | ||
|
||
.. method:: LlamaMarkdownReader(*args, **kwargs) | ||
|
||
Create a `pdf_markdown_reader.PDFMarkdownReader` using the `LlamaIndex`_ package. Please note that this package will **not automatically be installed** when installing **pymupdf4llm**. | ||
|
||
For details on the possible arguments, please consult the LlamaIndex documentation [#f1]_. | ||
|
||
:raises: `NotImplementedError`: Please install required `LlamaIndex`_ package. | ||
:returns: a `pdf_markdown_reader.PDFMarkdownReader` and issues message "Successfully imported LlamaIndex". Please note that this method needs several seconds to execute. For details on using the markdown reader please see below. | ||
|
||
---- | ||
|
||
|
||
.. class:: pdf_markdown_reader.PDFMarkdownReader | ||
|
||
.. method:: load_data(file_path: Union[Path, str], extra_info: Optional[Dict] = None, **load_kwargs: Any) -> List[LlamaIndexDocument] | ||
|
||
This is the only method of the markdown reader you should currently use to extract markdown data. Please in any case ignore methods `aload_data()` and `lazy_load_data()`. Other methods like `use_doc_meta()` may or may not make sense. For more information, please consult the LlamaIndex documentation [#f1]_. | ||
|
||
Under the hood the method will execute `to_markdown()`. | ||
|
||
:returns: a list of `LlamaIndexDocument` documents - one for each page. | ||
|
||
|
||
.. rubric:: Footnotes | ||
|
||
.. [#f1] `LlamaIndex documentation <https://docs.llamaindex.ai/en/stable/>`_ | ||
.. include:: ../footer.rst | ||
|
||
.. _LlamaIndex: https://pypi.org/project/llama-index/ | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
|
||
.. include:: ../header.rst | ||
|
||
.. _pymupdf4llm | ||
PyMuPDF4LLM | ||
=========================================================================== | ||
|
||
|PyMuPDF4LLM| is aimed to make it easier to extract **PDF** content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown extraction <extracting_as_md>` as well as :ref:`LlamaIndex document output <extracting_as_llamaindex>`. | ||
|
||
.. important:: | ||
|
||
You can extend the supported file types to also include **Office** document formats (DOC/DOCX, XLS/XLSX, PPT/PPTX, HWP/HWPX) by :ref:`using PyMuPDF Pro with PyMuPDF4LLM <using_pymupdf4llm_withpymupdfpro>`. | ||
|
||
Features | ||
------------------------------- | ||
|
||
- Support for multi-column pages | ||
- Support for image and vector graphics extraction (and inclusion of references in the MD text) | ||
- Support for page chunking output. | ||
- Direct support for output as :ref:`LlamaIndex Documents <extracting_as_llamaindex>`. | ||
|
||
|
||
Functionality | ||
-------------------- | ||
|
||
- This package converts the pages of a file to text in **Markdown** format using |PyMuPDF|. | ||
|
||
- Standard text and tables are detected, brought in the right reading sequence and then together converted to **GitHub**-compatible **Markdown** text. | ||
|
||
- Header lines are identified via the font size and appropriately prefixed with one or more `#` tags. | ||
|
||
- Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists. | ||
|
||
- By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of `0`-based page numbers. | ||
|
||
|
||
Installation | ||
---------------- | ||
|
||
|
||
Install the package via **pip** with: | ||
|
||
|
||
.. code-block:: bash | ||
pip install pymupdf4llm | ||
.. _extracting_as_md: | ||
|
||
Extracting a file as **Markdown** | ||
-------------------------------------------------------------- | ||
|
||
To retrieve your document content in **Markdown** simply install the package and then use a couple of lines of **Python** code to get results. | ||
|
||
|
||
|
||
Then in your **Python** script do: | ||
|
||
|
||
.. code-block:: python | ||
import pymupdf4llm | ||
md_text = pymupdf4llm.to_markdown("input.pdf") | ||
.. note:: | ||
|
||
Instead of the filename string as above, one can also provide a :ref:`PyMuPDF Document <Document>`. A second parameter may be a list of `0`-based page numbers, e.g. `[0,1]` would just select the first and second pages of the document. | ||
|
||
|
||
If you want to store your **Markdown** file, e.g. store as a UTF8-encoded file, then do: | ||
|
||
|
||
.. code-block:: python | ||
import pathlib | ||
pathlib.Path("output.md").write_bytes(md_text.encode()) | ||
.. _extracting_as_llamaindex: | ||
|
||
Extracting a file as a **LlamaIndex** document | ||
-------------------------------------------------------------- | ||
|
||
|PyMuPDF4LLM| supports direct conversion to a **LLamaIndex** document. A document is first converted into **Markdown** format and then a **LlamaIndex** document is returned as follows: | ||
|
||
|
||
|
||
.. code-block:: python | ||
import pymupdf4llm | ||
llama_reader = pymupdf4llm.LlamaMarkdownReader() | ||
llama_docs = llama_reader.load_data("input.pdf") | ||
.. _using_pymupdf4llm_withpymupdfpro: | ||
|
||
Using with |PyMuPDF Pro| | ||
--------------------------- | ||
|
||
|
||
For **Office** document support, |PyMuPDF4LLM| works seamlessly with |PyMuPDF Pro|. Assuming you have :doc:`../pymupdf-pro` installed you will be able to work with **Office** documents as expected: | ||
|
||
|
||
.. code-block:: python | ||
import pymupdf4llm | ||
import pymupdf.pro | ||
pymupdf.pro.unlock() | ||
md_text = pymupdf4llm.to_markdown("sample.doc") | ||
As you can see |PyMuPDF Pro| functionality will be available within the |PyMuPDF4LLM| context! | ||
|
||
|
||
|
||
API | ||
------- | ||
|
||
See :ref:`the PyMuPDF4LLM API <pymupdf4llm-api>`. | ||
|
||
Further Resources | ||
------------------- | ||
|
||
|
||
Sample code | ||
~~~~~~~~~~~~~~~ | ||
|
||
- `Command line RAG Chatbot with PyMuPDF <https://github.com/pymupdf/RAG/tree/main/country-capitals>`_ | ||
- `Example of a Browser Application using Langchain and PyMuPDF <https://github.com/pymupdf/RAG/tree/main/GUI>`_ | ||
|
||
|
||
Blogs | ||
~~~~~~~~~~~~~~ | ||
|
||
- `RAG/LLM and PDF: Enhanced Text Extraction <https://artifex.com/blog/rag-llm-and-pdf-enhanced-text-extraction>`_ | ||
- `Creating a RAG Chatbot with ChatGPT and PyMuPDF <https://artifex.com/blog/creating-a-rag-chatbot-with-chatgpt-and-pymupdf>`_ | ||
- `Building a RAG Chatbot GUI with the ChatGPT API and PyMuPDF <https://artifex.com/blog/building-a-rag-chatbot-gui-with-the-chatgpt-api-and-pymupdf>`_ | ||
- `RAG/LLM and PDF: Conversion to Markdown Text with PyMuPDF <https://artifex.com/blog/rag-llm-and-pdf-conversion-to-markdown-text-with-pymupdf>`_ | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
.. include:: ../footer.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters