diff --git a/DEVELOPER.md b/DEVELOPER.md deleted file mode 100644 index d5d69b17..00000000 --- a/DEVELOPER.md +++ /dev/null @@ -1,57 +0,0 @@ -# Developers - -## .editorconfig - -Please make sure your editor uses our `.editorconfig` file. It contains rules about our coding styles. - -## Development Tools and Tests - -Our test related files are located in `tests` folder. -Tests are written using PHPUnit. - -To install (and update) development tools like PHPUnit or PHP-CS-Fixer run: - -> make install-dev-tools - -Development tools are getting installed in `dev-tools/vendor`. -Please check `dev-tools/composer.json` for more information about versions etc. -To run a tool manually you use `dev-tools/vendor/bin`, for instance: - -> dev-tools/vendor/bin/php-cs-fixer fix --verbose --dry-run - -Below are a few shortcuts to improve your developer experience. - -### PHPUnit - -To run all tests run: - -> make run-phpunit - -### PHP-CS-Fixer - -To check coding styles run: - -> make run-php-cs-fixer - -### PHPStan - -To run a static code analysis use: - -> make run-phpstan - -## Base64 encoded PDFs - -If working with [Base64](https://en.wikipedia.org/wiki/Base64) encoded PDFs you might want to parse the PDF without saving the file on disk. This sample will parse the Base64 encoded PDF and extract text from each page. - -```php -parseContent(base64_decode($base64PDF)); - -$text = $pdf->getText(); -echo $text; -``` diff --git a/README.md b/README.md index 293be93f..f277c81d 100644 --- a/README.md +++ b/README.md @@ -1,64 +1,58 @@ -# PdfParser # - -Pdf Parser, a standalone PHP library, provides various tools to extract data from a PDF file. +# PDF parser +[![Version](https://poser.pugx.org/smalot/pdfparser/v)](//packagist.org/packages/smalot/pdfparser) ![CI](https://github.com/smalot/pdfparser/workflows/CI/badge.svg) +![CS](https://github.com/smalot/pdfparser/workflows/CS/badge.svg) [![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/smalot/pdfparser/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/smalot/pdfparser/?branch=master) -[![Code Coverage](https://scrutinizer-ci.com/g/smalot/pdfparser/badges/coverage.png?b=master)](https://scrutinizer-ci.com/g/smalot/pdfparser/?branch=master) -[![License](https://poser.pugx.org/smalot/pdfparser/license)](//packagist.org/packages/smalot/pdfparser) - -[![Latest Stable Version](https://poser.pugx.org/smalot/pdfparser/v)](//packagist.org/packages/smalot/pdfparser) -[![Total Downloads](https://poser.pugx.org/smalot/pdfparser/downloads)](//packagist.org/packages/smalot/pdfparser) -[![Monthly Downloads](https://poser.pugx.org/smalot/pdfparser/d/monthly)](//packagist.org/packages/smalot/pdfparser) -[![Daily Downloads](https://poser.pugx.org/smalot/pdfparser/d/daily)](//packagist.org/packages/smalot/pdfparser) - -Website : [https://www.pdfparser.org](https://www.pdfparser.org/?utm_source=GitHub&utm_medium=website&utm_campaign=GitHub) - -Test the API on our [demo page](https://www.pdfparser.org/demo). +[![Downloads](https://poser.pugx.org/smalot/pdfparser/downloads)](//packagist.org/packages/smalot/pdfparser) -This project is supported by [Actualys](http://www.actualys.com). +The `smalot/pdfparser` is a standalone PHP package that provides various tools to extract data from PDF files. -## Features ## +This library is under **active maintenance**. +There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality! -Features included : +## Features - Load/parse objects and headers -- Extract meta data (author, description, ...) +- Extract metadata (author, description, ...) - Extract text from ordered pages -- Support of compressed pdf +- Support of compressed PDFs - Support of MAC OS Roman charset encoding - Handling of hexa and octal encoding in text sections -- PSR-0 compliant ([autoloader](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-0.md)) -- PSR-1 compliant ([code styling](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-1-basic-coding-standard.md)) +- Create custom configurations (see [CustomConfig.md](/doc/CustomConfig.md)). -Currently, secured documents are not supported. +Currently, secured documents and extracting form data are not supported. -**This Library is under active maintenance.** -There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality! +## License -## Documentation ## +This library is under the [LGPLv3 license](https://github.com/smalot/pdfparser/blob/master/LICENSE.txt). -[Read the documentation on the wiki](https://github.com/smalot/pdfparser/wiki). +## Install -Original PDF References files can be downloaded from this url: http://www.adobe.com/devnet/pdf/pdf_reference_archive.html +This library requires PHP 7.1+ since [v1](https://github.com/smalot/pdfparser/releases/tag/v1.0.0). +You can install it via [Composer](https://getcomposer.org/): -### For developers +```bash +compose require smalot/pdfparser +``` -Please read [DEVELOPER.md](DEVELOPER.md) for more information about local development of the PDFParser library. Here you will also find information about how to handle Base63 encoded PDFs. +In case you can't use Composer, you can include `alt_autoload.php-dist`. It will include all required files automatically. -## Installation +## Quick example -### Using Composer +```php +parseFile('/path/to/document.pdf'); -### Use alternate file loader +$text = $pdf->getText(); +echo $text; +``` -In case you can't use Composer, you can include `alt_autoload.php-dist` into your project. -It will load all required files at once. -Afterwards you can use `PDFParser` class and others. +Further usage information can be found [here](/doc/Usage.md). -## License ## +## Documentation -This library is under the [LGPLv3 license](https://github.com/smalot/pdfparser/blob/master/LICENSE.txt). +Documentation can be found in the [doc](/doc) folder. diff --git a/doc/CustomConfig.md b/doc/CustomConfig.md new file mode 100644 index 00000000..34d5c1cf --- /dev/null +++ b/doc/CustomConfig.md @@ -0,0 +1,65 @@ +# Configuring the behavior of the parser + +To change the behavior of the parser, create a `Config` object and pass it to the parser. +In this case, we're setting the font space limit. +Changing this value can be helpful when `getText()` returns a text with too many spaces. + +```php +$config = new \Smalot\PdfParser\Config(); +$config->setFontSpaceLimit(-60); +$parser = new \Smalot\PdfParser\Parser([], $config); +$pdf = $parser->parseFile('document.pdf'); +// output extracted text +// echo $pdf->getText(); +``` + +## Config options overview + +The `Config` class has the following options: + +| Option | Type | Default | Description | +|--------------------------|---------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------| +| `setDecodeMemoryLimit` | Integer | `0` | If parsing fails because of memory exhaustion, you can set a lower memory limit for decoding operations. | +| `setFontSpaceLimit` | Integer | `-50` | Changing font space limit can be helpful when `Parser::getText()` returns a text with too many spaces. | +| `setHorizontalOffset` | String | ` ` | When words are broken up or when the structure of a table is not preserved, you may get better results when adapting `setHorizontalOffset`. | +| `setPdfWhitespaces` | String | `\0\t\n\f\r ` | | +| `setPdfWhitespacesRegex` | String | `[\0\t\n\f\r ]` | | +| `setRetainImageContent` | Boolean | `true` | If parsing fails because of memory exhaustion, you can set the value to `false`. It wont retain image content anymore, but will use less memory too. | + + +## option setDecodeMemoryLimit + setRetainImageContent (manage memory usage) + +If parsing fails because of memory exhaustion, you can use the following options. + +```php +$config = new \Smalot\PdfParser\Config(); +// Whether to retain raw image data as content or discard it to save memory +$config->setRetainImageContent(false); +// Memory limit to use when de-compressing files, in bytes +$config->setDecodeMemoryLimit(1000000); +$parser = new \Smalot\PdfParser\Parser([], $config); +``` + +## option setHorizontalOffset + +When words are broken up or when the structure of a table is not preserved, you can use `setHorizontalOffset`. + +```php +$config = new \Smalot\PdfParser\Config(); +// An empty string can prevent words from breaking up +$config->setHorizontalOffset(''); +// A tab can help preserve the structure of your document +$config->setHorizontalOffset("\t"); +$parser = new \Smalot\PdfParser\Parser([], $config); +``` + +## option setFontSpaceLimit + +Changing font space limit can be helpful when `getText()` returns a text with too many spaces. + +```php +$config = new \Smalot\PdfParser\Config(); +$config->setFontSpaceLimit(-60); +$parser = new \Smalot\PdfParser\Parser([], $config); +$pdf = $parser->parseFile('document.pdf'); +``` diff --git a/doc/Developer.md b/doc/Developer.md new file mode 100644 index 00000000..e108a82e --- /dev/null +++ b/doc/Developer.md @@ -0,0 +1,57 @@ +# Developers + +Here you will find information about our development tools and how to use them. + +## .editorconfig + +Please make sure your editor uses our `.editorconfig` file. It contains rules about our coding styles. + +## GitHub Action Workflows + +We use GitHub Actions to run our continuous integration as well as other tasks after pushing changes. +You will find related files in `.github/workflows/`. + +## Development Tools and Tests + +Our test related files are located in `tests` folder. +Tests are written using PHPUnit. + +To install (and update) development tools like PHPUnit or PHP-CS-Fixer run: + +```bash +make install-dev-tools +``` + +Development tools are getting installed in `dev-tools/vendor`. +Please check `dev-tools/composer.json` for more information about versions etc. +To run a tool manually, you use `dev-tools/vendor/bin`, for instance: + +```bash +dev-tools/vendor/bin/php-cs-fixer fix --verbose --dry-run +``` + +Below are a few shortcuts to improve your developer experience. + +### PHPUnit + +To run all tests run: + +```bash +make run-phpunit +``` + +### PHP-CS-Fixer + +To check coding styles, run: + +```bash +make run-php-cs-fixer +``` + +### PHPStan + +To run a static code analysis, use: + +```bash +make run-phpstan +``` diff --git a/doc/Usage.md b/doc/Usage.md new file mode 100644 index 00000000..db8528af --- /dev/null +++ b/doc/Usage.md @@ -0,0 +1,52 @@ +# Usage + +First create a parser object and point it to a file. + +```php +$parser = new \Smalot\PdfParser\Parser(); + +$pdf = $parser->parseFile('document.pdf'); +// .. or ... +$pdf = $parser->parseContent(file_get_contents('document.pdf')) + ``` + +## Extract text + +A common scenario is to extract text. + +```php +$text = $pdf->getText(); + +// or extract the text of a specific page (in this case the first page) +$text = $pdf->getPages()[0]->getText(); +``` + +## Extract metadata + +You can also extract metadata. The available data varies from PDF to PDF. + +```php +$metaData = $pdf->getDetails(); + +Array +( + [Producer] => Adobe Acrobat + [CreatedOn] => 2022-01-28T16:36:11+00:00 + [Pages] => 35 +) +``` + +## Read Base64 encoded PDFs + +If working with [Base64](https://en.wikipedia.org/wiki/Base64) encoded PDFs, you might want to parse the PDF without saving the file to disk. +This sample will parse the Base64 encoded PDF and extract text from each page. + +```php +parseContent(base64_decode($base64PDF)); + +$text = $pdf->getText(); +echo $text; +```