Skip to content

py-pdf/benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

24c51dd · Oct 31, 2023

History

57 Commits
Oct 31, 2023
Aug 26, 2023
Oct 31, 2023
Jul 2, 2023
May 14, 2022
Apr 21, 2023
Jul 2, 2023
May 8, 2022
Aug 26, 2023
Jun 1, 2022
Oct 31, 2023
Aug 26, 2023
Dec 31, 2022

Repository files navigation

PDF Library Benchmarks

This benchmark is about reading pure PDF files - notscanned documents and not documents that applied OCR.

Benchmarking machine

Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz

Input Documents

# Name File Size Pages
1 2201.00214 2.4MiB 22
2 GeoTopo-book 5.1MiB 117
3 2201.00151 1.5MiB 12
4 1707.09725 7.0MiB 134
5 2201.00021 2.6MiB 10
6 2201.00037 2.9MiB 33
7 2201.00069 14.7MiB 15
8 2201.00178 2.3MiB 16
9 2201.00201 1.3MiB 9
10 1602.06541 2.9MiB 16
11 2201.00200 284.8KiB 7
12 2201.00022 1.1MiB 11
13 2201.00029 797.6KiB 12
14 1601.03642 1004.9KiB 8

Libraries

Name Last PyPI Release License Version Dependencies
Borb 2023-06-23 AGPL/Commercial 2.1.16
pypdfium2 2023-07-04 Apache-2.0 or BSD-3-Clause 4.18.0 PDFium (Foxit/Google)
pdfminer.six 2022-11-05 MIT/X 20221105
pdfplumber 2023-07-29 MIT 0.10.2 pdfminer.six
pdfrw 2017-09-18 MIT 0.4
pdftotext - GPL 0.86.1 build-essential libpoppler-cpp-dev pkg-config python3-dev
PyMuPDF 2023-08-24 GNU AFFERO GPL 3.0 / Commerical 1.23.1 MuPDF
pypdf 2023-08-26 BSD 3-Clause 3.15.4
Tika 2023-01-01 Apache v2 2.6.0 Apache Tika

Text Extraction Speed

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 PyMuPDF 0.1s 0.4s 0.2s 0.2s 0.2s 0.0s 0.1s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s
2 pypdfium2 0.2s 1.9s 0.2s 0.2s 0.2s 0.0s 0.1s 0.1s 0.1s 0.0s 0.1s 0.0s 0.0s 0.0s 0.0s
3 pdftotext 0.3s 0.8s 1.0s 0.3s 0.8s 0.1s 0.2s 0.2s 0.1s 0.0s 0.1s 0.1s 0.1s 0.0s 0.0s
4 Tika 1.1s 12.9s 0.9s 0.6s 0.4s 0.1s 0.3s 0.2s 0.1s 0.1s 0.1s 0.1s 0.1s 0.0s 0.0s
5 pypdf 2.6s 18.7s 4.8s 5.3s 2.3s 0.7s 0.9s 0.4s 0.5s 0.3s 0.6s 0.5s 0.4s 0.4s 0.2s
6 pdfminer.six 4.5s 26.0s 12.9s 8.0s 4.6s 1.3s 2.1s 1.0s 1.2s 0.8s 1.5s 0.9s 0.9s 0.6s 0.6s
7 pdfplumber 6.7s 41.7s 10.9s 11.5s 8.4s 2.4s 4.3s 2.0s 1.9s 1.9s 2.7s 1.8s 1.7s 1.0s 1.2s
8 Borb 34.7s 111.2s 105.0s 1.4s 87.2s 21.1s 7.4s 83.5s 16.4s 20.3s 5.4s 3.4s 18.8s 3.2s 2.1s

Image Extraction Speed

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 PyMuPDF 0.5s 0.3s 0.5s 0.0s 1.7s 0.4s 0.0s 3.2s 0.4s 0.4s 0.1s 0.0s 0.3s 0.2s 0.0s
2 pypdf 2.8s 16.4s 2.1s 0.8s 9.2s 1.1s 0.0s 6.7s 0.9s 0.9s 0.4s 0.0s 0.7s 0.2s 0.1s
3 pdfminer.six 6.5s 31.8s 13.7s 9.2s 24.0s 1.5s 2.3s 1.5s 1.4s 0.9s 1.5s 0.9s 1.0s 0.6s 0.5s

Watermarking Speed

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 PyMuPDF 0.0s 0.0s 0.1s 0.0s 0.1s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s 0.0s
2 pdfrw 0.1s 0.0s 0.4s 0.0s 0.3s 0.1s 0.1s 0.1s 0.1s 0.1s 0.1s 0.0s 0.1s 0.0s 0.0s
3 pypdf 0.4s 0.6s 1.7s 0.4s 0.9s 0.2s 0.3s 0.4s 0.3s 0.2s 0.3s 0.1s 0.2s 0.0s 0.2s

Watermarking File Size

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 pdfrw 3.4MB 2.5MB 5.7MB 1.6MB 7.3MB 2.7MB 3.1MB 15.4MB 2.4MB 1.3MB 3.0MB 0.3MB 1.1MB 0.8MB 1.0MB
2 pypdf 3.5MB 2.5MB 5.7MB 1.6MB 7.3MB 2.7MB 3.1MB 15.4MB 2.4MB 1.3MB 3.0MB 0.3MB 1.1MB 0.8MB 1.0MB
3 PyMuPDF 3.7MB 2.7MB 6.8MB 1.7MB 8.5MB 2.8MB 3.4MB 15.5MB 2.5MB 1.4MB 3.2MB 0.3MB 1.2MB 0.9MB 1.1MB

Text Extraction Quality

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 pypdfium2 98% 99% 97% 94% 99% 98% 96% 99% 98% 99% 99% 98% 98% 99% 99%
2 pypdf 97% 98% 93% 94% 98% 98% 96% 97% 98% 99% 99% 98% 98% 98% 99%
3 PyMuPDF 97% 98% 96% 93% 97% 98% 96% 98% 98% 98% 98% 97% 97% 98% 99%
4 Tika 96% 99% 98% 92% 97% 98% 96% 93% 97% 98% 93% 98% 93% 98% 96%
5 pdftotext 93% 96% 93% 91% 94% 92% 96% 96% 96% 97% 83% 94% 96% 96% 79%
6 pdfminer.six 90% 95% 79% 86% 92% 86% 93% 95% 93% 92% 92% 93% 86% 98% 86%
7 pdfplumber 75% 94% 84% 61% 97% 61% 93% 61% 89% 57% 59% 67% 59% 98% 67%
8 Borb 45% 70% 79% 0% 40% 48% 92% 0% 64% 51% 41% 55% 43% 0% 53%