Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowed memory exhausted when parse the PDF file. #631

Open
durifal opened this issue Aug 8, 2023 · 18 comments
Open

Allowed memory exhausted when parse the PDF file. #631

durifal opened this issue Aug 8, 2023 · 18 comments
Labels

Comments

@durifal
Copy link

durifal commented Aug 8, 2023

  • PHP Version: 7.4.33
  • PDFParser Version: 2.5.0

Description:

Trying to parse this PDF always result in Allowed memory exhausted error.

Error: Allowed memory size of 1077936128 bytes exhausted (tried to
allocate 335544320 bytes) in
...../smalot/pdfparser/src/Smalot/PdfParser/Font.php,
line 223

Set up PHP memory limit to 4GB did not help either.
I have also tried to setDecodeMemoryLimit to lower but still had the same memory issue. Setting Decode memory limit prevent the error only when I set it to 1000 or lower. So maybe it should be set in MB and not in bytes, or there is an bug in the code.

PDF input

test_pdf.pdf

Expected output & actual output

Parser should either parse the text from the PDF, or return empty string or some exception and not memory error.

Code

$config = new Config();
$url = 'path_to_PDF_folder/test_pdf.pdf';
$config->setRetainImageContent(false);
$config->setDecodeMemoryLimit(1000000);
$parser = new Parser([], $config);
$pdf = $parser->parseFile($url);
@k00ni k00ni added the bug label Aug 8, 2023
@k00ni
Copy link
Collaborator

k00ni commented Aug 8, 2023

Thanks for reporting. What program did you use to generate the PDF? To be sure the error still exist, please try again with latest master branch.

@durifal
Copy link
Author

durifal commented Aug 8, 2023

I do not know what generated the PDF, because visitors of our sites uploaded it as Cover letter, which we try to parse so full-text would search also in attachment. I have just edit the PDF in Adobe PDF editor to anonymize data.

We hit this problem multiple times during parsing the PDFs, so if necessary I can anonymize more examples. But it is pretty rare (about 10 PDFs out of 1 000 000). All of them had on one site text on some background color.

I have tested problematic PDF with the same result also on master branch:

Fatal error: Allowed memory size of 1077936128 bytes exhausted (tried to allocate 335544320 bytes) in ........../smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 230

@k00ni
Copy link
Collaborator

k00ni commented Aug 9, 2023

Thank you for the feedback.

@denydias
Copy link

A similar issue also hit me. I'll post this here as this looks like a common unhandled exception, but let me know if you need an specific issue. Just like the OP, a small portion of a much larger batch appears to be affected.

As for the PDF creator:

Creator: Adobe Acrobat 7.0
Producer: Adobe Acrobat 7.0 Paper Capture Plug-in

PdfParser exception:

[2023-10-21 07:48:18] ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 204800 bytes) {
  "userId":2,"exception":"[object] (
    Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0):
    Allowed memory size of 134217728 bytes exhausted (tried to allocate 204800 bytes) at
    vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239
  )
  [stacktrace]
  #0 {main}"
}
[2023-10-21 07:48:20] ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 188416 bytes) {
  "userId":2,"exception":"[object] (
    Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0):
    Allowed memory size of 134217728 bytes exhausted (tried to allocate 188416 bytes) at
    vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239
  )
  [stacktrace]
  #0 {main}"
}

@k00ni
Copy link
Collaborator

k00ni commented Oct 31, 2023

@denydias can you provide your PDFs, which cause this exception?

Also, try #634 and check if the exception remains.

@denydias
Copy link

Thank you for the quick reply, @k00ni! I'll try the PR and let you know the results. Please expect some delay as these are very busy days here.

@denydias
Copy link

denydias commented Oct 31, 2023

@k00ni is there a way to send the source document for your eyes only? It could not be shared in public.

As for the tests with #634, before (using v2.7.0):

PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 12288 bytes) in tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 775
PHP Stack trace:
PHP   1. {main}() tests/pdfparser/test.php:0
PHP   2. getPDFPageCount($file = 'test.pdf', $origin = 'test') tests/pdfparser/test.php:19
PHP   3. Smalot\PdfParser\Parser->parseFile($filename = 'test.pdf') tests/pdfparser/test.php:29
PHP   4. Smalot\PdfParser\Parser->parseContent([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:90
PHP   5. Smalot\PdfParser\RawData\RawDataParser->parseData([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:102
PHP   6. Smalot\PdfParser\RawData\RawDataParser->getIndirectObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:945
PHP   7. Smalot\PdfParser\RawData\RawDataParser->getRawObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:557
PHP   8. substr([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:775

After (using master+#634):

PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 32768 bytes) in tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 104
PHP Stack trace:
PHP   1. {main}() tests/pdfparser/test.php:0
PHP   2. getPDFPageCount($file = 'test.pdf', $origin = 'test') tests/pdfparser/test.php:10
PHP   3. Smalot\PdfParser\Parser->parseFile($filename = 'test.pdf') tests/pdfparser/test.php:20
PHP   4. Smalot\PdfParser\Parser->parseContent([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:91
PHP   5. Smalot\PdfParser\RawData\RawDataParser->parseData([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:103
PHP   6. Smalot\PdfParser\RawData\RawDataParser->getIndirectObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:958
PHP   7. Smalot\PdfParser\RawData\RawDataParser->decodeStream[redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:104

My env:

$> php --version
PHP 8.2.12 (cli) (built: Oct 26 2023 18:01:05) (ZTS)
Copyright (c) The PHP Group
Zend Engine v4.2.12, Copyright (c) Zend Technologies
    with Zend OPcache v8.2.12, Copyright (c), by Zend Technologies
    with Xdebug v3.2.2, Copyright (c) 2002-2023, by Derick Rethans
$> composer --version
Composer version 2.6.5 2023-10-06 10:11:52

Test script:

<?php

ini_set("memory_limit", "128M");

require __DIR__ . '/vendor/autoload.php';

use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;

$pages = getPDFPageCount('test.pdf', 'test');
echo "File has $pages pages\n";

function getPDFPageCount(string $file, string $origin): mixed
{
    $config = new Config();
    $config->setRetainImageContent(false);
    $parser = new Parser([], $config);
    try {
        $pdf = $parser->parseFile($file);
        $details = $pdf->getDetails();
        return $details['Pages'];
    } catch (Exception $e) {
        $pages = 0;
        echo $e->getMessage();
        return $pages;
    }
}

@k00ni
Copy link
Collaborator

k00ni commented Nov 1, 2023

@denydias Thank you for your detailled answer. Don't send me the PDF privately, I don't to private support via mail.

#634 is the latest big set of changes, there was a chance that it might cover this case. The problem with these errors is, that they seem to be very PDF-dependent. We need further work on the parsing part to avoid endless loops/recursion.

@denydias
Copy link

denydias commented Nov 1, 2023

@k00ni I understand you don't provide private support and I'm not asking you to. I'm reporting an issue and looking to privately provide you with the entity where the problem occurs in the hope you can improve your product, but asking no warranties or even replies on that matter.

In most cases I agree with you for the PDF-dependent claim, but this particular one is part of a set with 1.706 files produced by a "pretty standard" (TM) workflow. As just this one triggers an exception, it looks like a perfect candidate for an edge case worth looking into. But this is not my call.

@k00ni
Copy link
Collaborator

k00ni commented Nov 7, 2023

As just this one triggers an exception, it looks like a perfect candidate for an edge case worth looking into. But this is not my call.

You are right. Would you create a pull request and help us solve the issue?

@denydias
Copy link

denydias commented Nov 7, 2023

I'll dive into it when I get the time, @k00ni.

@kreuss90
Copy link

I have the same issue (memory exhausted [in my case 500MB]) also with just one pdf on my website. I will provide a link to the document at the end of this post.
Another thing is similar to what @durifal wrote: The document has a colored background. (In opposite to all other documents)

Creator: Microsoft PowerPoint 2016
Link: https://memoone.de/Materialien/5.%20Fortbildungsmaterialien/1.%20Rechnernetze/1.%20Vortrag/1_MAT_Vortrag.pdf

I hope this helps you find the bug. Thanks for providing that great library!

Kind regards
Kevin

@sj-i
Copy link

sj-i commented Dec 8, 2023

To test a development version of our memory profiler, I've tried to investigate the leak in the original issue.

Test script

<?php

use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;

include "vendor/autoload.php";

ini_set('memory_limit', '128M');

register_shutdown_function(
    function (): void {
        $error = error_get_last();
        if (is_null($error)) {
            return;
        }
        if (strpos($error['message'], 'Allowed memory size of') !== 0) {
            return;
        }
        $pid = getmypid();
        $file_opt = '--memory-limit-error-file=' . escapeshellarg($error['file']);
        $line_opt = '--memory-limit-error-line=' . escapeshellarg($error['line']);
        system("sudo reli i:m -p {$pid} --no-stop-process {$file_opt} {$line_opt} >memory_analyzed.json");
    }
);

$config = new Config();
$url = __DIR__ . '/test_pdf.pdf';
$config->setRetainImageContent(false);
$config->setDecodeMemoryLimit(1000000);
$parser = new Parser([], $config);
$pdf = $parser->parseFile($url);

The summary of the memory usage

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .summary
[
  {
    "zend_mm_heap_total": 130023424,
    "zend_mm_heap_usage": 128245688,
    "zend_mm_chunk_total": 46137344,
    "zend_mm_chunk_usage": 44359608,
    "zend_mm_huge_total": 83886080,
    "zend_mm_huge_usage": 83886080,
    "vm_stack_total": 262144,
    "vm_stack_usage": 1632,
    "compiler_arena_total": 458752,
    "compiler_arena_usage": 7264,
    "possible_allocation_overhead_total": 3893453,
    "possible_array_overhead_total": 248704,
    "memory_get_usage": 128276816,
    "memory_get_real_usage": 130023424,
    "cached_chunks_size": 0,
    "heap_memory_analyzed_percentage": 99.97573372884466,
    "php_version": "v82",
    "analyzer": "reli 0.11.0"
  }
]
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .location_types_summary | jq -r '(["location_type", "count", "memory_usage"] | (., map(length*"="))),(to_entries[]|[.key,.value.count,.value.memory_usage])|@tsv' | column -t -o ' | '
location_type                        | count   | memory_usage
=============                        | =====   | ============
ZendArrayTableMemoryLocation         | 600     | 84052280
ZendStringMemoryLocation             | 1049683 | 38511955
ZendObjectMemoryLocation             | 10278   | 742320
ZendArrayTableOverheadMemoryLocation | 595     | 159296
ObjectsStoreMemoryLocation           | 1       | 131072
ZendArrayMemoryLocation              | 602     | 33712
RuntimeCacheMemoryLocation           | 101     | 7360
CallFrameVariableTableMemoryLocation | 9       | 832
CallFrameHeaderMemoryLocation        | 10      | 800
ZendOpArrayHeaderMemoryLocation      | 1       | 248
StaticMembersTableMemoryLocation     | 5       | 176
ZendResourceMemoryLocation           | 3       | 72
ZendReferenceMemoryLocation          | 2       | 64
ZendMmHugeListMemoryLocation         | 2       | 48

As you can see in the above, arrays and strings occupy the majority of memory consumption.
The number of arrays is small, so I doubt that only a few number of arrays are eating up a large size.

Finding the culprit arrays

Let's extract the 20 largest ones in order of size.

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '. as $root | path(..|objects|select(."#type"=="ArrayElementsContext"))| . as $path | $root|getpath($path) as $elements | {path: $path|join("."), size: $elements."#locations"[0].size, count: $elements."#count"}' | jq -rs '(["size", "count", "path"] | (., map(length*"="))),(sort_by(.size) | .[-20:] | reverse | .[] | [.size, .count, .path])|@tsv' | column -t -o ' | '
size     | count   | path
====     | =====   | ====
41943040 | 1048576 | context.class_table.smalot\\pdfparser\\font.static_properties.uchrCache.array_elements
41913376 | 1047649 | context.call_frames.3.this.object_properties.table.array_elements
36552    | 2284    | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
13720    | 857     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.26_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
8840     | 552     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
6536     | 408     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.545.value.object_properties.value.array_elements
3496     | 218     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.36.value.object_properties.value.array_elements
3480     | 217     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.148.value.object_properties.value.array_elements
3352     | 209     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.428.value.object_properties.value.array_elements
3264     | 70      | context.call_frames.9.symbol_table.array_elements._SERVER.value.array_elements
2216     | 138     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.110.value.object_properties.value.array_elements
1864     | 116     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.413.value.object_properties.value.array_elements
1784     | 111     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.123.value.object_properties.value.array_elements
1696     | 37      | context.call_frames.7.local_variables.xref.array_elements.xref.value.array_elements
1688     | 105     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.572.value.object_properties.value.array_elements
1672     | 104     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.26.value.object_properties.value.array_elements
1608     | 100     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.299.value.object_properties.value.array_elements
1600     | 34      | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements
1496     | 93      | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.489.value.object_properties.value.array_elements
1432     | 89      | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.2108.value.object_properties.value.array_elements

Two arrays are the culprits.

Dumping the real stack trace on memory_limit violations is the new feature I want to test on this trial (so not yet released). And it seems that it works well.

~/work/oss/tmp/pdfparser_test$  cat memory_analyzed.json | jq -r '(["frame_no", "function", "line"] | (., map(length*"="))),(path(.context.call_frames[]|objects) as $path | [$path[2], getpath($path).function_name, getpath($path).lineno])|@tsv' | column -t
frame_no  function                                                         line
========  ========                                                         ====
0         system                                                           4
1         {closure}(/home/sji/work/oss/tmp/pdfparser_test/test.php:11-21)  20
2         Smalot\\PdfParser\\Font::uchr                                    150
3         Smalot\\PdfParser\\Font::loadTranslateTable                      230
4         Smalot\\PdfParser\\Font::init                                    78
5         Smalot\\PdfParser\\Document::init                                90
6         Smalot\\PdfParser\\Document::setObjects                          316
7         Smalot\\PdfParser\\Parser::parseContent                          122
8         Smalot\\PdfParser\\Parser::parseFile                             90
9         <main>                                                           29

So, two arrays, Font::$uchrCache and Font::$table, are the culprits. Also, the memory_limit violation seems to occur at the point where Font::uchr() is called from Font::loadTranslateTable() at line 230.

Why these arrays grow so large

Then let's also dump some seemingly related local variables.

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '.context.call_frames."3".local_variables |{char: .char, char_from: .char_from, char_to: .char_to, offset: .offset, key: .key}'
{
  "char": {
    "#node_id": 2147647,
    "#type": "ScalarValueContext",
    "value": 1047644
  },
  "char_from": {
    "#node_id": 2147644,
    "#type": "ScalarValueContext",
    "value": 64287
  },
  "char_to": {
    "#node_id": 2147645,
    "#type": "ScalarValueContext",
    "value": 4276029042
  },
  "offset": {
    "#node_id": 2147646,
    "#type": "ScalarValueContext",
    "value": 4276094578
  },
  "key": {
    "#node_id": 2147638,
    "#type": "ScalarValueContext",
    "value": 50
  }
}
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq -r '.context.call_frames."3".local_variables |.matches.referenced.array_elements."0".value.array_elements."50".value'
{
  "#node_id": 2147143,
  "#type": "StringContext",
  "#locations": [
    {
      "address": 139914548230560,
      "size": 53,
      "refcount": 1,
      "type_info": 22,
      "value": "<FB1F> <FEDF0672> <FEE00672> "
    }
  ]
}

It seems that one of $char_to in the beginbfrange sections has a ligature, so both the translation table and the character cache have grown unintentionally large size.

I am not familiar with the PDF specification, so cannot send a PR to fix it. Sorry.

I am already happy with the successful testing of my tool, and I hope this report can make someone else happy too.

Changelog

  • 2023-12-18 18:30(UTC) -- Reli 0.11.0 is out, and the feature caputuring the real stack trace on memory_limit violations is included in it. So I changed the test script a bit to fit the version.

@denydias
Copy link

denydias commented Dec 8, 2023

...I hope this report can make someone else happy too.

I am! Superb debug job, @sj-i! 👏

@4ndrzej
Copy link

4ndrzej commented Mar 11, 2024

We experiencing the same issue. Any news on this?

@intrak
Copy link

intrak commented May 22, 2024

Hi There ! Any news with that bug ?
This file from first post still are problesome..
I'm on the newest 2.10.0 v.

@Buschdieb
Copy link

Buschdieb commented Jul 20, 2024

On Version 2.10.0 i get the Fatal error: Maximum execution time of 20 seconds exceeded in /vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 150 if i set a time limit. My approach is not to process the ‘faulty PDFs’ indefinitely but to throw them out beforehand.

To the Font.php public static function uchr($code): string i add the following check on unicode.

if ($code < 0 || $code > 0x10FFFF) { throw new \Exception('Invalid Unicode character code: ' . $code); }

That makes sure that it doesn't go on indefinitely for me. I mainly have problems with PDFs that contain attachments like ZUGFeRD invoices (PDF + XML Attachment). I have tested it with my PDFs and with the test_pdf.pdf from this issue above.

ERROR LOG for test_pdf.pdf Exception: Invalid Unicode character code.

@AykutCevik
Copy link

Still an issue with v2.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants