Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore encryption #653

Merged
merged 9 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions doc/CustomConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ The `Config` class has the following options:
|--------------------------|---------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| `setDecodeMemoryLimit` | Integer | `0` | If parsing fails because of memory exhaustion, you can set a lower memory limit for decoding operations. |
| `setFontSpaceLimit` | Integer | `-50` | Changing font space limit can be helpful when `Parser::getText()` returns a text with too many spaces. |
| `setIgnoreEncryption` | Boolean | `false` | Read PDFs that are not encrypted but have the encryption flag set. This is a temporary workaround, don't rely on it. |
| `setHorizontalOffset` | String | ` ` | When words are broken up or when the structure of a table is not preserved, you may get better results when adapting `setHorizontalOffset`. |
| `setPdfWhitespaces` | String | `\0\t\n\f\r ` | |
| `setPdfWhitespacesRegex` | String | `[\0\t\n\f\r ]` | |
Expand Down Expand Up @@ -63,3 +64,17 @@ $config->setFontSpaceLimit(-60);
$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('document.pdf');
```

## option setIgnoreEncryption

In some cases PDF files may be internally marked as encrypted even though the content is not encrypted and can be read.
This can be caused by the PDF being created by a tool that does not properly set the encryption flag.
If you are sure that the PDF is not encrypted, you can ignore the encryption flag by setting the `ignoreEncryption` flag to `true` in a custom `Config` instance.

```php
$config = new \Smalot\PdfParser\Config();
$config->setIgnoreEncryption(true);

$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('document.pdf');
```
11 changes: 11 additions & 0 deletions doc/Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,3 +229,14 @@ foreach ($pages as $page) {
];
}
```

## PDF encryption

This library cannot currently read encrypted PDF files, i.e. those with
a read password. Attempting to do so produces this error:
```
Exception: Secured pdf file are currently not supported.
```

See `setIgnoreEncryption` option in [CustomConfig.md](CustomConfig.md)
for how to override the check in specific cases.
Binary file added samples/not_really_encrypted.pdf
Binary file not shown.
21 changes: 21 additions & 0 deletions src/Smalot/PdfParser/Config.php
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,13 @@ class Config
*/
private $dataTmFontInfoHasToBeIncluded = false;

/**
* Whether to attempt to read PDFs even if they are marked as encrypted.
*
* @var bool
*/
private $ignoreEncryption = false;

public function getFontSpaceLimit()
{
return $this->fontSpaceLimit;
Expand Down Expand Up @@ -151,4 +158,18 @@ public function setDataTmFontInfoHasToBeIncluded(bool $dataTmFontInfoHasToBeIncl
{
$this->dataTmFontInfoHasToBeIncluded = $dataTmFontInfoHasToBeIncluded;
}

public function getIgnoreEncryption(): bool
{
return $this->ignoreEncryption;
}

/**
* @deprecated this is a temporary workaround, don't rely on it
* @see https://github.com/smalot/pdfparser/pull/653
*/
public function setIgnoreEncryption(bool $ignoreEncryption): void
{
$this->ignoreEncryption = $ignoreEncryption;
}
}
2 changes: 1 addition & 1 deletion src/Smalot/PdfParser/Parser.php
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ public function parseContent(string $content): Document
// Create structure from raw data.
list($xref, $data) = $this->rawDataParser->parseData($content);

if (isset($xref['trailer']['encrypt'])) {
if (isset($xref['trailer']['encrypt']) && false === $this->config->getIgnoreEncryption()) {
throw new \Exception('Secured pdf file are currently not supported.');
}

Expand Down
35 changes: 35 additions & 0 deletions tests/PHPUnit/Integration/ParserTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,41 @@ public function testRetainImageContentImpact(): void
$this->assertTrue($usedMemory < ($baselineMemory * 1.05), 'Memory is '.$usedMemory);
$this->assertTrue('' !== $document->getText());
}

/**
* Tests handling of encrypted PDF.
*
* @see https://github.com/smalot/pdfparser/pull/653
*/
public function testNoIgnoreEncryption(): void
{
$filename = $this->rootDir.'/samples/not_really_encrypted.pdf';
$threw = false;
try {
(new Parser([]))->parseFile($filename);
} catch (\Exception $e) {
// we expect an exception to be thrown if an encrypted PDF is encountered.
$threw = true;
}
$this->assertTrue($threw);
}

/**
* Tests behavior if encryption is ignored.
*
* @see https://github.com/smalot/pdfparser/pull/653
*/
public function testIgnoreEncryption(): void
{
$config = new Config();
$config->setIgnoreEncryption(true);

$filename = $this->rootDir.'/samples/not_really_encrypted.pdf';

$this->assertTrue((new Parser([], $config))->parseFile($filename) instanceof Document);

// without the configuration option set, an exception would be thrown.
}
}

class ParserSub extends Parser
Expand Down
Loading