smalot · k00ni · Mar 7, 2022 · Feb 12, 2022 · Feb 14, 2022 · Feb 14, 2022
diff --git a/DEVELOPER.md b/DEVELOPER.md
diff --git a/README.md b/README.md
@@ -1,64 +1,58 @@
-# PdfParser #
-
-Pdf Parser, a standalone PHP library, provides various tools to extract data from a PDF file.
+# PDF parser
 
+[![Version](https://poser.pugx.org/smalot/pdfparser/v)](//packagist.org/packages/smalot/pdfparser)
 ![CI](https://github.com/smalot/pdfparser/workflows/CI/badge.svg)
+![CS](https://github.com/smalot/pdfparser/workflows/CS/badge.svg)
 [![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/smalot/pdfparser/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/smalot/pdfparser/?branch=master)
-[![Code Coverage](https://scrutinizer-ci.com/g/smalot/pdfparser/badges/coverage.png?b=master)](https://scrutinizer-ci.com/g/smalot/pdfparser/?branch=master)
-[![License](https://poser.pugx.org/smalot/pdfparser/license)](//packagist.org/packages/smalot/pdfparser)
-
-[![Latest Stable Version](https://poser.pugx.org/smalot/pdfparser/v)](//packagist.org/packages/smalot/pdfparser)
-[![Total Downloads](https://poser.pugx.org/smalot/pdfparser/downloads)](//packagist.org/packages/smalot/pdfparser)
-[![Monthly Downloads](https://poser.pugx.org/smalot/pdfparser/d/monthly)](//packagist.org/packages/smalot/pdfparser)
-[![Daily Downloads](https://poser.pugx.org/smalot/pdfparser/d/daily)](//packagist.org/packages/smalot/pdfparser)
-
-Website : [https://www.pdfparser.org](https://www.pdfparser.org/?utm_source=GitHub&utm_medium=website&utm_campaign=GitHub)
-
-Test the API on our [demo page](https://www.pdfparser.org/demo).
+[![Downloads](https://poser.pugx.org/smalot/pdfparser/downloads)](//packagist.org/packages/smalot/pdfparser)
 
-This project is supported by [Actualys](http://www.actualys.com).
+The `smalot/pdfparser` is a standalone PHP package that provides various tools to extract data from PDF files.
 
-## Features ##
+This library is under **active maintenance**.
+There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality!
 
-Features included :
+## Features
 
 - Load/parse objects and headers
-- Extract meta data (author, description, ...)
+- Extract metadata (author, description, ...)
 - Extract text from ordered pages
-- Support of compressed pdf
+- Support of compressed PDFs
 - Support of MAC OS Roman charset encoding
 - Handling of hexa and octal encoding in text sections
-- PSR-0 compliant ([autoloader](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-0.md))
-- PSR-1 compliant ([code styling](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-1-basic-coding-standard.md))
+- Create custom configurations (see [CustomConfig.md](/doc/CustomConfig.md)).
 
-Currently, secured documents are not supported.
+Currently, secured documents and extracting form data are not supported.
 
-**This Library is under active maintenance.**
-There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality!
+## License
 
-## Documentation ##
+This library is under the [LGPLv3 license](https://github.com/smalot/pdfparser/blob/master/LICENSE.txt).
 
-[Read the documentation on the wiki](https://github.com/smalot/pdfparser/wiki).
+## Install
 
-Original PDF References files can be downloaded from this url: http://www.adobe.com/devnet/pdf/pdf_reference_archive.html
+This library requires PHP 7.1+ since [v1](https://github.com/smalot/pdfparser/releases/tag/v1.0.0).
+You can install it via [Composer](https://getcomposer.org/):
 
-### For developers
+```bash
+compose require smalot/pdfparser
+```
 
-Please read [DEVELOPER.md](DEVELOPER.md) for more information about local development of the PDFParser library. Here you will also find information about how to handle Base63 encoded PDFs.
+In case you can't use Composer, you can include `alt_autoload.php-dist`. It will include all required files automatically.
 
-## Installation
+## Quick example
 
-### Using Composer
+```php
+<?php
 
-* Obtain [Composer](https://getcomposer.org)
-* Run `composer require smalot/pdfparser`
+// Parse PDF file and build necessary objects.
+$parser = new \Smalot\PdfParser\Parser();
+$pdf = $parser->parseFile('/path/to/document.pdf');
 
-### Use alternate file loader
+$text = $pdf->getText();
+echo $text;
+```
 
-In case you can't use Composer, you can include `alt_autoload.php-dist` into your project.
-It will load all required files at once.
-Afterwards you can use `PDFParser` class and others.
+Further usage information can be found [here](/doc/Usage.md).
 
-## License ##
+## Documentation
 
-This library is under the [LGPLv3 license](https://github.com/smalot/pdfparser/blob/master/LICENSE.txt).
+Documentation can be found in the [doc](/doc) folder.
diff --git a/doc/CustomConfig.md b/doc/CustomConfig.md
@@ -0,0 +1,65 @@
+# Configuring the behavior of the parser
+
+To change the behavior of the parser, create a `Config` object and pass it to the parser.
+In this case, we're setting the font space limit.
+Changing this value can be helpful when `getText()` returns a text with too many spaces.
+
+```php
+$config = new \Smalot\PdfParser\Config();
+$config->setFontSpaceLimit(-60);
+$parser = new \Smalot\PdfParser\Parser([], $config);
+$pdf = $parser->parseFile('document.pdf');
+// output extracted text
+// echo $pdf->getText();
+```
+
+## Config options overview
+
+The `Config` class has the following options:
+
+| Option                   | Type    | Default         | Description                                                                                                                                          |
+|--------------------------|---------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `setDecodeMemoryLimit`   | Integer | `0`             | If parsing fails because of memory exhaustion, you can set a lower memory limit for decoding operations.                                             |
+| `setFontSpaceLimit`      | Integer | `-50`           | Changing font space limit can be helpful when `Parser::getText()` returns a text with too many spaces.                                               |
+| `setHorizontalOffset`    | String  | ` `             | When words are broken up or when the structure of a table is not preserved, you may get better results when adapting `setHorizontalOffset`.          |
+| `setPdfWhitespaces`      | String  | `\0\t\n\f\r `   |                                                                                                                                                      |
+| `setPdfWhitespacesRegex` | String  | `[\0\t\n\f\r ]` |                                                                                                                                                      |
+| `setRetainImageContent`  | Boolean | `true`          | If parsing fails because of memory exhaustion, you can set the value to `false`. It wont retain image content anymore, but will use less memory too. |
+
+
+## option setDecodeMemoryLimit + setRetainImageContent (manage memory usage)
+
+If parsing fails because of memory exhaustion, you can use the following options.
+
+```php
+$config = new \Smalot\PdfParser\Config();
+// Whether to retain raw image data as content or discard it to save memory
+$config->setRetainImageContent(false);
+// Memory limit to use when de-compressing files, in bytes
+$config->setDecodeMemoryLimit(1000000);
+$parser = new \Smalot\PdfParser\Parser([], $config);
+```
+
+## option setHorizontalOffset
+
+When words are broken up or when the structure of a table is not preserved, you can use `setHorizontalOffset`.
+
+```php
+$config = new \Smalot\PdfParser\Config();
+// An empty string can prevent words from breaking up
+$config->setHorizontalOffset('');
+// A tab can help preserve the structure of your document
+$config->setHorizontalOffset("\t");
+$parser = new \Smalot\PdfParser\Parser([], $config);
+```
+
+## option setFontSpaceLimit
+
+Changing font space limit can be helpful when `getText()` returns a text with too many spaces.
+
+```php
+$config = new \Smalot\PdfParser\Config();
+$config->setFontSpaceLimit(-60);
+$parser = new \Smalot\PdfParser\Parser([], $config);
+$pdf = $parser->parseFile('document.pdf');
+```
diff --git a/doc/Developer.md b/doc/Developer.md
@@ -0,0 +1,57 @@
+# Developers
+
+Here you will find information about our development tools and how to use them.
+
+## .editorconfig
+
+Please make sure your editor uses our `.editorconfig` file. It contains rules about our coding styles.
+
+## GitHub Action Workflows
+
+We use GitHub Actions to run our continuous integration as well as other tasks after pushing changes.
+You will find related files in `.github/workflows/`.
+
+## Development Tools and Tests
+
+Our test related files are located in `tests` folder.
+Tests are written using PHPUnit.
+
+To install (and update) development tools like PHPUnit or PHP-CS-Fixer run:
+
+```bash
+make install-dev-tools
+```
+
+Development tools are getting installed in `dev-tools/vendor`.
+Please check `dev-tools/composer.json` for more information about versions etc.
+To run a tool manually, you use `dev-tools/vendor/bin`, for instance:
+
+```bash
+dev-tools/vendor/bin/php-cs-fixer fix --verbose --dry-run
+```
+
+Below are a few shortcuts to improve your developer experience.
+
+### PHPUnit
+
+To run all tests run:
+
+```bash
+make run-phpunit
+```
+
+### PHP-CS-Fixer
+
+To check coding styles, run:
+
+```bash
+make run-php-cs-fixer
+```
+
+### PHPStan
+
+To run a static code analysis, use:
+
+```bash
+make run-phpstan
+```
diff --git a/doc/Usage.md b/doc/Usage.md
@@ -0,0 +1,52 @@
+# Usage
+
+First create a parser object and point it to a file.
+
+```php
+$parser = new \Smalot\PdfParser\Parser();
+
+$pdf = $parser->parseFile('document.pdf');
+// .. or ...
+$pdf = $parser->parseContent(file_get_contents('document.pdf'))
+ ```
+
+## Extract text
+
+A common scenario is to extract text.
+
+```php
+$text = $pdf->getText();
+
+// or extract the text of a specific page (in this case the first page)
+$text = $pdf->getPages()[0]->getText();
+```
+
+## Extract metadata
+
+You can also extract metadata. The available data varies from PDF to PDF.
+
+```php
+$metaData = $pdf->getDetails();
+
+Array
+(
+    [Producer] => Adobe Acrobat
+    [CreatedOn] => 2022-01-28T16:36:11+00:00
+    [Pages] => 35
+)
+```
+
+## Read Base64 encoded PDFs
+
+If working with [Base64](https://en.wikipedia.org/wiki/Base64) encoded PDFs, you might want to parse the PDF without saving the file to disk.
+This sample will parse the Base64 encoded PDF and extract text from each page.
+
+```php
+<?php
+// Parse Base64 encoded PDF string and build necessary objects.
+$parser = new \Smalot\PdfParser\Parser();
+$pdf = $parser->parseContent(base64_decode($base64PDF));
+
+$text = $pdf->getText();
+echo $text;
+```