Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework documentation #513

Merged
merged 7 commits into from
Mar 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 0 additions & 57 deletions DEVELOPER.md

This file was deleted.

72 changes: 33 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,64 +1,58 @@
# PdfParser #

Pdf Parser, a standalone PHP library, provides various tools to extract data from a PDF file.
# PDF parser

[![Version](https://poser.pugx.org/smalot/pdfparser/v)](//packagist.org/packages/smalot/pdfparser)
![CI](https://github.com/smalot/pdfparser/workflows/CI/badge.svg)
![CS](https://github.com/smalot/pdfparser/workflows/CS/badge.svg)
[![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/smalot/pdfparser/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/smalot/pdfparser/?branch=master)
[![Code Coverage](https://scrutinizer-ci.com/g/smalot/pdfparser/badges/coverage.png?b=master)](https://scrutinizer-ci.com/g/smalot/pdfparser/?branch=master)
[![License](https://poser.pugx.org/smalot/pdfparser/license)](//packagist.org/packages/smalot/pdfparser)

[![Latest Stable Version](https://poser.pugx.org/smalot/pdfparser/v)](//packagist.org/packages/smalot/pdfparser)
[![Total Downloads](https://poser.pugx.org/smalot/pdfparser/downloads)](//packagist.org/packages/smalot/pdfparser)
[![Monthly Downloads](https://poser.pugx.org/smalot/pdfparser/d/monthly)](//packagist.org/packages/smalot/pdfparser)
[![Daily Downloads](https://poser.pugx.org/smalot/pdfparser/d/daily)](//packagist.org/packages/smalot/pdfparser)

Website : [https://www.pdfparser.org](https://www.pdfparser.org/?utm_source=GitHub&utm_medium=website&utm_campaign=GitHub)

Test the API on our [demo page](https://www.pdfparser.org/demo).
[![Downloads](https://poser.pugx.org/smalot/pdfparser/downloads)](//packagist.org/packages/smalot/pdfparser)

This project is supported by [Actualys](http://www.actualys.com).
The `smalot/pdfparser` is a standalone PHP package that provides various tools to extract data from PDF files.

## Features ##
This library is under **active maintenance**.
There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality!

Features included :
## Features

- Load/parse objects and headers
- Extract meta data (author, description, ...)
- Extract metadata (author, description, ...)
- Extract text from ordered pages
- Support of compressed pdf
- Support of compressed PDFs
- Support of MAC OS Roman charset encoding
- Handling of hexa and octal encoding in text sections
- PSR-0 compliant ([autoloader](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-0.md))
- PSR-1 compliant ([code styling](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-1-basic-coding-standard.md))
- Create custom configurations (see [CustomConfig.md](/doc/CustomConfig.md)).

Currently, secured documents are not supported.
Currently, secured documents and extracting form data are not supported.

**This Library is under active maintenance.**
There is no active development by the author of this library (at the moment), but we welcome any pull request adding/extending functionality!
## License

## Documentation ##
This library is under the [LGPLv3 license](https://github.com/smalot/pdfparser/blob/master/LICENSE.txt).

[Read the documentation on the wiki](https://github.com/smalot/pdfparser/wiki).
## Install

Original PDF References files can be downloaded from this url: http://www.adobe.com/devnet/pdf/pdf_reference_archive.html
This library requires PHP 7.1+ since [v1](https://github.com/smalot/pdfparser/releases/tag/v1.0.0).
You can install it via [Composer](https://getcomposer.org/):

### For developers
```bash
compose require smalot/pdfparser
```

Please read [DEVELOPER.md](DEVELOPER.md) for more information about local development of the PDFParser library. Here you will also find information about how to handle Base63 encoded PDFs.
In case you can't use Composer, you can include `alt_autoload.php-dist`. It will include all required files automatically.

## Installation
## Quick example

### Using Composer
```php
<?php

* Obtain [Composer](https://getcomposer.org)
* Run `composer require smalot/pdfparser`
// Parse PDF file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('/path/to/document.pdf');

### Use alternate file loader
$text = $pdf->getText();
echo $text;
```

In case you can't use Composer, you can include `alt_autoload.php-dist` into your project.
It will load all required files at once.
Afterwards you can use `PDFParser` class and others.
Further usage information can be found [here](/doc/Usage.md).

## License ##
## Documentation

This library is under the [LGPLv3 license](https://github.com/smalot/pdfparser/blob/master/LICENSE.txt).
Documentation can be found in the [doc](/doc) folder.
65 changes: 65 additions & 0 deletions doc/CustomConfig.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Configuring the behavior of the parser

To change the behavior of the parser, create a `Config` object and pass it to the parser.
In this case, we're setting the font space limit.
Changing this value can be helpful when `getText()` returns a text with too many spaces.

```php
$config = new \Smalot\PdfParser\Config();
$config->setFontSpaceLimit(-60);
$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('document.pdf');
// output extracted text
// echo $pdf->getText();
```

## Config options overview

The `Config` class has the following options:

| Option | Type | Default | Description |
|--------------------------|---------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| `setDecodeMemoryLimit` | Integer | `0` | If parsing fails because of memory exhaustion, you can set a lower memory limit for decoding operations. |
| `setFontSpaceLimit` | Integer | `-50` | Changing font space limit can be helpful when `Parser::getText()` returns a text with too many spaces. |
| `setHorizontalOffset` | String | ` ` | When words are broken up or when the structure of a table is not preserved, you may get better results when adapting `setHorizontalOffset`. |
| `setPdfWhitespaces` | String | `\0\t\n\f\r ` | |
| `setPdfWhitespacesRegex` | String | `[\0\t\n\f\r ]` | |
| `setRetainImageContent` | Boolean | `true` | If parsing fails because of memory exhaustion, you can set the value to `false`. It wont retain image content anymore, but will use less memory too. |


## option setDecodeMemoryLimit + setRetainImageContent (manage memory usage)

If parsing fails because of memory exhaustion, you can use the following options.

```php
$config = new \Smalot\PdfParser\Config();
// Whether to retain raw image data as content or discard it to save memory
$config->setRetainImageContent(false);
// Memory limit to use when de-compressing files, in bytes
$config->setDecodeMemoryLimit(1000000);
$parser = new \Smalot\PdfParser\Parser([], $config);
```

## option setHorizontalOffset

When words are broken up or when the structure of a table is not preserved, you can use `setHorizontalOffset`.

```php
$config = new \Smalot\PdfParser\Config();
// An empty string can prevent words from breaking up
$config->setHorizontalOffset('');
// A tab can help preserve the structure of your document
$config->setHorizontalOffset("\t");
$parser = new \Smalot\PdfParser\Parser([], $config);
```

## option setFontSpaceLimit

Changing font space limit can be helpful when `getText()` returns a text with too many spaces.

```php
$config = new \Smalot\PdfParser\Config();
$config->setFontSpaceLimit(-60);
$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('document.pdf');
```
57 changes: 57 additions & 0 deletions doc/Developer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Developers

Here you will find information about our development tools and how to use them.

## .editorconfig

Please make sure your editor uses our `.editorconfig` file. It contains rules about our coding styles.

## GitHub Action Workflows

We use GitHub Actions to run our continuous integration as well as other tasks after pushing changes.
You will find related files in `.github/workflows/`.

## Development Tools and Tests

Our test related files are located in `tests` folder.
Tests are written using PHPUnit.

To install (and update) development tools like PHPUnit or PHP-CS-Fixer run:

```bash
make install-dev-tools
```

Development tools are getting installed in `dev-tools/vendor`.
Please check `dev-tools/composer.json` for more information about versions etc.
To run a tool manually, you use `dev-tools/vendor/bin`, for instance:

```bash
dev-tools/vendor/bin/php-cs-fixer fix --verbose --dry-run
```

Below are a few shortcuts to improve your developer experience.

### PHPUnit

To run all tests run:

```bash
make run-phpunit
```

### PHP-CS-Fixer

To check coding styles, run:

```bash
make run-php-cs-fixer
```

### PHPStan

To run a static code analysis, use:

```bash
make run-phpstan
```
52 changes: 52 additions & 0 deletions doc/Usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Usage

First create a parser object and point it to a file.

```php
$parser = new \Smalot\PdfParser\Parser();

$pdf = $parser->parseFile('document.pdf');
// .. or ...
$pdf = $parser->parseContent(file_get_contents('document.pdf'))
```

## Extract text

A common scenario is to extract text.

```php
$text = $pdf->getText();

// or extract the text of a specific page (in this case the first page)
$text = $pdf->getPages()[0]->getText();
```

## Extract metadata

You can also extract metadata. The available data varies from PDF to PDF.

```php
$metaData = $pdf->getDetails();

Array
(
[Producer] => Adobe Acrobat
[CreatedOn] => 2022-01-28T16:36:11+00:00
[Pages] => 35
)
```

## Read Base64 encoded PDFs

If working with [Base64](https://en.wikipedia.org/wiki/Base64) encoded PDFs, you might want to parse the PDF without saving the file to disk.
This sample will parse the Base64 encoded PDF and extract text from each page.

```php
<?php
// Parse Base64 encoded PDF string and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseContent(base64_decode($base64PDF));

$text = $pdf->getText();
echo $text;
```