-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Project scope #1743
DOC: Project scope #1743
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Scope of pypdf | ||
|
||
What features should pypdf have and which features will it never have? | ||
|
||
pypdf aims at making interactions with PDF documents simpler. Core tasks that | ||
pypdf can perform are: | ||
|
||
* Document manipulation: Splitting, merging, cropping, and transforming the pages of PDF files | ||
* Data Extraction: Extract text and metadata from PDF documents | ||
* Security: Decrypt / encrypt PDF documents | ||
|
||
Typical indicators that something should be done by pypdf: | ||
|
||
* The task needs in-depth knowledge of the PDF format | ||
* It currently requires a lot of code or even is impossible to do with pypdf | ||
* It's neither mentioned in "belongs in user code" nor in "out of scope" | ||
* It already is in the issue list with the [is-feature tag](https://github.com/py-pdf/pypdf/labels/is-feature). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be interresting also to add ref to the roadmap and requested feature thread There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've just linked the moonshot extensions. Is this what you were thinking about? |
||
|
||
The [moonshot extensions](https://github.com/py-pdf/pypdf/discussions/1181) are | ||
features we would like to have, but are currently not able to add (PRs are | ||
welcome 😉) | ||
|
||
## Belongs in user code | ||
|
||
Here are a few indicators that a feature belongs into users code (and not into pypdf): | ||
|
||
1. The use-case is very specific. Most people will not encounter the same need. | ||
2. It can be done without knowledge of the PDF specification | ||
3. It cannot be done without (non-pdf) domain knowledge. Anything that is | ||
specific to your industry. | ||
|
||
## Out of scope | ||
|
||
While this list is infinitely long, there are a few topics that are asked | ||
multiple times. | ||
|
||
Those topics are out of scope for pypdf. They will never be part of pypdf: | ||
|
||
1. **Optical Character Recognition (OCR)**: OCR is about extracting text from | ||
images. That is very different from the kind of text extraction pypdf is | ||
doing. Please note that images can be within PDF documents. In the case of | ||
scanned documents, the whole page is an image. Some scanners automatically | ||
execute OCR and add a text-layer behind the scanned page. That is something | ||
pypdf can use, if it's present. As a rule-of-thumb: If you cannot mark/copy | ||
the text, it's likely an image. A noteworthy open source OCR project is | ||
[tesseract](https://github.com/tesseract-ocr/tesseract). | ||
2. **Format Conversion**: Converting docx / HTML to PDF or PDF to those formats. | ||
You might want to have a look at [`pdfkit`](https://pypi.org/project/pdfkit/) | ||
and similar projects. | ||
|
||
Out of scope for the moment, but might be added if there are enough contributors: | ||
|
||
* **Digital Signature Support** ([reference | ||
ticket](https://github.com/py-pdf/pypdf/issues/302)): Cryptography is | ||
complicated. It's important to get it right. pypdf currently doesn't have | ||
enough active contributors to properly add digital signautre support. For the | ||
moment, [pyhanko](https://pypi.org/project/pyHanko/) seems to be the best | ||
choice. | ||
* **PDF Generation from Scratch**: pypdf can manipulate existing PDF documents, | ||
add annotations, combine / split / crop / transform. It can add blank pages. | ||
But if you want to generate invoices, you might want to have a look at | ||
[`reportlab`](https://pypi.org/project/reportlab/) / | ||
[`fpdf2`](https://pypi.org/project/fpdf2/) or document conversion tools like | ||
[`pdfkit`](https://pypi.org/project/pdfkit/). | ||
* **Replacing words within a PDF**: [Extracting text from PDF is hard](https://pypdf.readthedocs.io/en/stable/user/extract-text.html#why-text-extraction-is-hard). | ||
Replacing text in a reliable way is even harder. For example, one word might | ||
be split into multiple tokens. Hence it's not a simple "search and replace" | ||
in some cases. | ||
* **(Not) Extracting headers/footers/page numbers**: While you can apply | ||
heuristics, there is no way to always make it work. PDF documents simply | ||
don't contain the information what a header/footer/page number is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see other pages where such information could be added, however I would have raised this into user guide section: inhere the "standard" users may not read it.