Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Project scope #1743

Merged
merged 4 commits into from
Mar 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ You can contribute to `pypdf on GitHub <https://github.com/py-pdf/pypdf>`_.
meta/project-governance
meta/history
meta/CONTRIBUTORS
meta/scope-of-pypdf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see other pages where such information could be added, however I would have raised this into user guide section: inhere the "standard" users may not read it.

meta/comparisons
meta/faq

Expand Down
71 changes: 71 additions & 0 deletions docs/meta/scope-of-pypdf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Scope of pypdf

What features should pypdf have and which features will it never have?

pypdf aims at making interactions with PDF documents simpler. Core tasks that
pypdf can perform are:

* Document manipulation: Splitting, merging, cropping, and transforming the pages of PDF files
* Data Extraction: Extract text and metadata from PDF documents
* Security: Decrypt / encrypt PDF documents

Typical indicators that something should be done by pypdf:

* The task needs in-depth knowledge of the PDF format
* It currently requires a lot of code or even is impossible to do with pypdf
* It's neither mentioned in "belongs in user code" nor in "out of scope"
* It already is in the issue list with the [is-feature tag](https://github.com/py-pdf/pypdf/labels/is-feature).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be interresting also to add ref to the roadmap and requested feature thread

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just linked the moonshot extensions. Is this what you were thinking about?


The [moonshot extensions](https://github.com/py-pdf/pypdf/discussions/1181) are
features we would like to have, but are currently not able to add (PRs are
welcome 😉)

## Belongs in user code

Here are a few indicators that a feature belongs into users code (and not into pypdf):

1. The use-case is very specific. Most people will not encounter the same need.
2. It can be done without knowledge of the PDF specification
3. It cannot be done without (non-pdf) domain knowledge. Anything that is
specific to your industry.

## Out of scope

While this list is infinitely long, there are a few topics that are asked
multiple times.

Those topics are out of scope for pypdf. They will never be part of pypdf:

1. **Optical Character Recognition (OCR)**: OCR is about extracting text from
images. That is very different from the kind of text extraction pypdf is
doing. Please note that images can be within PDF documents. In the case of
scanned documents, the whole page is an image. Some scanners automatically
execute OCR and add a text-layer behind the scanned page. That is something
pypdf can use, if it's present. As a rule-of-thumb: If you cannot mark/copy
the text, it's likely an image. A noteworthy open source OCR project is
[tesseract](https://github.com/tesseract-ocr/tesseract).
2. **Format Conversion**: Converting docx / HTML to PDF or PDF to those formats.
You might want to have a look at [`pdfkit`](https://pypi.org/project/pdfkit/)
and similar projects.

Out of scope for the moment, but might be added if there are enough contributors:

* **Digital Signature Support** ([reference
ticket](https://github.com/py-pdf/pypdf/issues/302)): Cryptography is
complicated. It's important to get it right. pypdf currently doesn't have
enough active contributors to properly add digital signautre support. For the
moment, [pyhanko](https://pypi.org/project/pyHanko/) seems to be the best
choice.
* **PDF Generation from Scratch**: pypdf can manipulate existing PDF documents,
add annotations, combine / split / crop / transform. It can add blank pages.
But if you want to generate invoices, you might want to have a look at
[`reportlab`](https://pypi.org/project/reportlab/) /
[`fpdf2`](https://pypi.org/project/fpdf2/) or document conversion tools like
[`pdfkit`](https://pypi.org/project/pdfkit/).
* **Replacing words within a PDF**: [Extracting text from PDF is hard](https://pypdf.readthedocs.io/en/stable/user/extract-text.html#why-text-extraction-is-hard).
Replacing text in a reliable way is even harder. For example, one word might
be split into multiple tokens. Hence it's not a simple "search and replace"
in some cases.
* **(Not) Extracting headers/footers/page numbers**: While you can apply
heuristics, there is no way to always make it work. PDF documents simply
don't contain the information what a header/footer/page number is.