Allowed memory exhausted when parse the PDF file. #631

durifal · 2023-08-08T12:47:39Z

PHP Version: 7.4.33
PDFParser Version: 2.5.0

Description:

Trying to parse this PDF always result in Allowed memory exhausted error.

Error: Allowed memory size of 1077936128 bytes exhausted (tried to
allocate 335544320 bytes) in
...../smalot/pdfparser/src/Smalot/PdfParser/Font.php,
line 223

Set up PHP memory limit to 4GB did not help either.
I have also tried to setDecodeMemoryLimit to lower but still had the same memory issue. Setting Decode memory limit prevent the error only when I set it to 1000 or lower. So maybe it should be set in MB and not in bytes, or there is an bug in the code.

PDF input

test_pdf.pdf

Expected output & actual output

Parser should either parse the text from the PDF, or return empty string or some exception and not memory error.

Code

$config = new Config();
$url = 'path_to_PDF_folder/test_pdf.pdf';
$config->setRetainImageContent(false);
$config->setDecodeMemoryLimit(1000000);
$parser = new Parser([], $config);
$pdf = $parser->parseFile($url);

The text was updated successfully, but these errors were encountered:

k00ni · 2023-08-08T14:21:59Z

Thanks for reporting. What program did you use to generate the PDF? To be sure the error still exist, please try again with latest master branch.

durifal · 2023-08-08T14:46:15Z

I do not know what generated the PDF, because visitors of our sites uploaded it as Cover letter, which we try to parse so full-text would search also in attachment. I have just edit the PDF in Adobe PDF editor to anonymize data.

We hit this problem multiple times during parsing the PDFs, so if necessary I can anonymize more examples. But it is pretty rare (about 10 PDFs out of 1 000 000). All of them had on one site text on some background color.

I have tested problematic PDF with the same result also on master branch:

Fatal error: Allowed memory size of 1077936128 bytes exhausted (tried to allocate 335544320 bytes) in ........../smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 230

k00ni · 2023-08-09T06:36:15Z

Thank you for the feedback.

denydias · 2023-10-31T12:26:19Z

A similar issue also hit me. I'll post this here as this looks like a common unhandled exception, but let me know if you need an specific issue. Just like the OP, a small portion of a much larger batch appears to be affected.

As for the PDF creator:

Creator: Adobe Acrobat 7.0
Producer: Adobe Acrobat 7.0 Paper Capture Plug-in

PdfParser exception:

[2023-10-21 07:48:18] ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 204800 bytes) {
  "userId":2,"exception":"[object] (
    Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0):
    Allowed memory size of 134217728 bytes exhausted (tried to allocate 204800 bytes) at
    vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239
  )
  [stacktrace]
  #0 {main}"
}
[2023-10-21 07:48:20] ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 188416 bytes) {
  "userId":2,"exception":"[object] (
    Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0):
    Allowed memory size of 134217728 bytes exhausted (tried to allocate 188416 bytes) at
    vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239
  )
  [stacktrace]
  #0 {main}"
}

k00ni · 2023-10-31T13:04:19Z

@denydias can you provide your PDFs, which cause this exception?

Also, try #634 and check if the exception remains.

denydias · 2023-10-31T13:32:43Z

Thank you for the quick reply, @k00ni! I'll try the PR and let you know the results. Please expect some delay as these are very busy days here.

denydias · 2023-10-31T21:40:39Z

@k00ni is there a way to send the source document for your eyes only? It could not be shared in public.

As for the tests with #634, before (using v2.7.0):

PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 12288 bytes) in tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 775
PHP Stack trace:
PHP   1. {main}() tests/pdfparser/test.php:0
PHP   2. getPDFPageCount($file = 'test.pdf', $origin = 'test') tests/pdfparser/test.php:19
PHP   3. Smalot\PdfParser\Parser->parseFile($filename = 'test.pdf') tests/pdfparser/test.php:29
PHP   4. Smalot\PdfParser\Parser->parseContent([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:90
PHP   5. Smalot\PdfParser\RawData\RawDataParser->parseData([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:102
PHP   6. Smalot\PdfParser\RawData\RawDataParser->getIndirectObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:945
PHP   7. Smalot\PdfParser\RawData\RawDataParser->getRawObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:557
PHP   8. substr([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:775

After (using master+#634):

PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 32768 bytes) in tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 104
PHP Stack trace:
PHP   1. {main}() tests/pdfparser/test.php:0
PHP   2. getPDFPageCount($file = 'test.pdf', $origin = 'test') tests/pdfparser/test.php:10
PHP   3. Smalot\PdfParser\Parser->parseFile($filename = 'test.pdf') tests/pdfparser/test.php:20
PHP   4. Smalot\PdfParser\Parser->parseContent([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:91
PHP   5. Smalot\PdfParser\RawData\RawDataParser->parseData([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:103
PHP   6. Smalot\PdfParser\RawData\RawDataParser->getIndirectObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:958
PHP   7. Smalot\PdfParser\RawData\RawDataParser->decodeStream[redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:104

My env:

$> php --version
PHP 8.2.12 (cli) (built: Oct 26 2023 18:01:05) (ZTS)
Copyright (c) The PHP Group
Zend Engine v4.2.12, Copyright (c) Zend Technologies
    with Zend OPcache v8.2.12, Copyright (c), by Zend Technologies
    with Xdebug v3.2.2, Copyright (c) 2002-2023, by Derick Rethans
$> composer --version
Composer version 2.6.5 2023-10-06 10:11:52

Test script:

<?php

ini_set("memory_limit", "128M");

require __DIR__ . '/vendor/autoload.php';

use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;

$pages = getPDFPageCount('test.pdf', 'test');
echo "File has $pages pages\n";

function getPDFPageCount(string $file, string $origin): mixed
{
    $config = new Config();
    $config->setRetainImageContent(false);
    $parser = new Parser([], $config);
    try {
        $pdf = $parser->parseFile($file);
        $details = $pdf->getDetails();
        return $details['Pages'];
    } catch (Exception $e) {
        $pages = 0;
        echo $e->getMessage();
        return $pages;
    }
}

k00ni · 2023-11-01T07:21:08Z

@denydias Thank you for your detailled answer. Don't send me the PDF privately, I don't to private support via mail.

#634 is the latest big set of changes, there was a chance that it might cover this case. The problem with these errors is, that they seem to be very PDF-dependent. We need further work on the parsing part to avoid endless loops/recursion.

denydias · 2023-11-01T08:06:08Z

@k00ni I understand you don't provide private support and I'm not asking you to. I'm reporting an issue and looking to privately provide you with the entity where the problem occurs in the hope you can improve your product, but asking no warranties or even replies on that matter.

In most cases I agree with you for the PDF-dependent claim, but this particular one is part of a set with 1.706 files produced by a "pretty standard" (TM) workflow. As just this one triggers an exception, it looks like a perfect candidate for an edge case worth looking into. But this is not my call.

k00ni · 2023-11-07T07:26:20Z

As just this one triggers an exception, it looks like a perfect candidate for an edge case worth looking into. But this is not my call.

You are right. Would you create a pull request and help us solve the issue?

denydias · 2023-11-07T11:44:15Z

I'll dive into it when I get the time, @k00ni.

kreuss90 · 2023-11-25T19:08:01Z

I have the same issue (memory exhausted [in my case 500MB]) also with just one pdf on my website. I will provide a link to the document at the end of this post.
Another thing is similar to what @durifal wrote: The document has a colored background. (In opposite to all other documents)

Creator: Microsoft PowerPoint 2016
Link: https://memoone.de/Materialien/5.%20Fortbildungsmaterialien/1.%20Rechnernetze/1.%20Vortrag/1_MAT_Vortrag.pdf

I hope this helps you find the bug. Thanks for providing that great library!

Kind regards
Kevin

sj-i · 2023-12-08T20:00:07Z

To test a development version of our memory profiler, I've tried to investigate the leak in the original issue.

Test script

<?php

use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;

include "vendor/autoload.php";

ini_set('memory_limit', '128M');

register_shutdown_function(
    function (): void {
        $error = error_get_last();
        if (is_null($error)) {
            return;
        }
        if (strpos($error['message'], 'Allowed memory size of') !== 0) {
            return;
        }
        $pid = getmypid();
        $file_opt = '--memory-limit-error-file=' . escapeshellarg($error['file']);
        $line_opt = '--memory-limit-error-line=' . escapeshellarg($error['line']);
        system("sudo reli i:m -p {$pid} --no-stop-process {$file_opt} {$line_opt} >memory_analyzed.json");
    }
);

$config = new Config();
$url = __DIR__ . '/test_pdf.pdf';
$config->setRetainImageContent(false);
$config->setDecodeMemoryLimit(1000000);
$parser = new Parser([], $config);
$pdf = $parser->parseFile($url);

The summary of the memory usage

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .summary
[
  {
    "zend_mm_heap_total": 130023424,
    "zend_mm_heap_usage": 128245688,
    "zend_mm_chunk_total": 46137344,
    "zend_mm_chunk_usage": 44359608,
    "zend_mm_huge_total": 83886080,
    "zend_mm_huge_usage": 83886080,
    "vm_stack_total": 262144,
    "vm_stack_usage": 1632,
    "compiler_arena_total": 458752,
    "compiler_arena_usage": 7264,
    "possible_allocation_overhead_total": 3893453,
    "possible_array_overhead_total": 248704,
    "memory_get_usage": 128276816,
    "memory_get_real_usage": 130023424,
    "cached_chunks_size": 0,
    "heap_memory_analyzed_percentage": 99.97573372884466,
    "php_version": "v82",
    "analyzer": "reli 0.11.0"
  }
]

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .location_types_summary | jq -r '(["location_type", "count", "memory_usage"] | (., map(length*"="))),(to_entries[]|[.key,.value.count,.value.memory_usage])|@tsv' | column -t -o ' | '
location_type                        | count   | memory_usage
=============                        | =====   | ============
ZendArrayTableMemoryLocation         | 600     | 84052280
ZendStringMemoryLocation             | 1049683 | 38511955
ZendObjectMemoryLocation             | 10278   | 742320
ZendArrayTableOverheadMemoryLocation | 595     | 159296
ObjectsStoreMemoryLocation           | 1       | 131072
ZendArrayMemoryLocation              | 602     | 33712
RuntimeCacheMemoryLocation           | 101     | 7360
CallFrameVariableTableMemoryLocation | 9       | 832
CallFrameHeaderMemoryLocation        | 10      | 800
ZendOpArrayHeaderMemoryLocation      | 1       | 248
StaticMembersTableMemoryLocation     | 5       | 176
ZendResourceMemoryLocation           | 3       | 72
ZendReferenceMemoryLocation          | 2       | 64
ZendMmHugeListMemoryLocation         | 2       | 48

As you can see in the above, arrays and strings occupy the majority of memory consumption.
The number of arrays is small, so I doubt that only a few number of arrays are eating up a large size.

Finding the culprit arrays

Let's extract the 20 largest ones in order of size.

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '. as $root | path(..|objects|select(."#type"=="ArrayElementsContext"))| . as $path | $root|getpath($path) as $elements | {path: $path|join("."), size: $elements."#locations"[0].size, count: $elements."#count"}' | jq -rs '(["size", "count", "path"] | (., map(length*"="))),(sort_by(.size) | .[-20:] | reverse | .[] | [.size, .count, .path])|@tsv' | column -t -o ' | '
size     | count   | path
====     | =====   | ====
41943040 | 1048576 | context.class_table.smalot\\pdfparser\\font.static_properties.uchrCache.array_elements
41913376 | 1047649 | context.call_frames.3.this.object_properties.table.array_elements
36552    | 2284    | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
13720    | 857     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.26_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
8840     | 552     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
6536     | 408     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.545.value.object_properties.value.array_elements
3496     | 218     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.36.value.object_properties.value.array_elements
3480     | 217     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.148.value.object_properties.value.array_elements
3352     | 209     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.428.value.object_properties.value.array_elements
3264     | 70      | context.call_frames.9.symbol_table.array_elements._SERVER.value.array_elements
2216     | 138     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.110.value.object_properties.value.array_elements
1864     | 116     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.413.value.object_properties.value.array_elements
1784     | 111     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.123.value.object_properties.value.array_elements
1696     | 37      | context.call_frames.7.local_variables.xref.array_elements.xref.value.array_elements
1688     | 105     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.572.value.object_properties.value.array_elements
1672     | 104     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.26.value.object_properties.value.array_elements
1608     | 100     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.299.value.object_properties.value.array_elements
1600     | 34      | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements
1496     | 93      | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.489.value.object_properties.value.array_elements
1432     | 89      | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.2108.value.object_properties.value.array_elements

Two arrays are the culprits.

Dumping the real stack trace on memory_limit violations is the new feature I want to test on this trial ~~(so not yet released)~~. And it seems that it works well.

~/work/oss/tmp/pdfparser_test$  cat memory_analyzed.json | jq -r '(["frame_no", "function", "line"] | (., map(length*"="))),(path(.context.call_frames[]|objects) as $path | [$path[2], getpath($path).function_name, getpath($path).lineno])|@tsv' | column -t
frame_no  function                                                         line
========  ========                                                         ====
0         system                                                           4
1         {closure}(/home/sji/work/oss/tmp/pdfparser_test/test.php:11-21)  20
2         Smalot\\PdfParser\\Font::uchr                                    150
3         Smalot\\PdfParser\\Font::loadTranslateTable                      230
4         Smalot\\PdfParser\\Font::init                                    78
5         Smalot\\PdfParser\\Document::init                                90
6         Smalot\\PdfParser\\Document::setObjects                          316
7         Smalot\\PdfParser\\Parser::parseContent                          122
8         Smalot\\PdfParser\\Parser::parseFile                             90
9         <main>                                                           29

So, two arrays, Font::$uchrCache and Font::$table, are the culprits. Also, the memory_limit violation seems to occur at the point where Font::uchr() is called from Font::loadTranslateTable() at line 230.

Why these arrays grow so large

Then let's also dump some seemingly related local variables.

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '.context.call_frames."3".local_variables |{char: .char, char_from: .char_from, char_to: .char_to, offset: .offset, key: .key}'
{
  "char": {
    "#node_id": 2147647,
    "#type": "ScalarValueContext",
    "value": 1047644
  },
  "char_from": {
    "#node_id": 2147644,
    "#type": "ScalarValueContext",
    "value": 64287
  },
  "char_to": {
    "#node_id": 2147645,
    "#type": "ScalarValueContext",
    "value": 4276029042
  },
  "offset": {
    "#node_id": 2147646,
    "#type": "ScalarValueContext",
    "value": 4276094578
  },
  "key": {
    "#node_id": 2147638,
    "#type": "ScalarValueContext",
    "value": 50
  }
}

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq -r '.context.call_frames."3".local_variables |.matches.referenced.array_elements."0".value.array_elements."50".value'
{
  "#node_id": 2147143,
  "#type": "StringContext",
  "#locations": [
    {
      "address": 139914548230560,
      "size": 53,
      "refcount": 1,
      "type_info": 22,
      "value": "<FB1F> <FEDF0672> <FEE00672> "
    }
  ]
}

It seems that one of $char_to in the beginbfrange sections has a ligature, so both the translation table and the character cache have grown unintentionally large size.

I am not familiar with the PDF specification, so cannot send a PR to fix it. Sorry.

I am already happy with the successful testing of my tool, and I hope this report can make someone else happy too.

Changelog

2023-12-18 18:30(UTC) -- Reli 0.11.0 is out, and the feature caputuring the real stack trace on memory_limit violations is included in it. So I changed the test script a bit to fit the version.

denydias · 2023-12-08T21:46:01Z

...I hope this report can make someone else happy too.

I am! Superb debug job, @sj-i! 👏

4ndrzej · 2024-03-11T17:19:47Z

We experiencing the same issue. Any news on this?

intrak · 2024-05-22T20:01:23Z

Hi There ! Any news with that bug ?
This file from first post still are problesome..
I'm on the newest 2.10.0 v.

Buschdieb · 2024-07-20T09:12:31Z

On Version 2.10.0 i get the Fatal error: Maximum execution time of 20 seconds exceeded in /vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 150 if i set a time limit. My approach is not to process the ‘faulty PDFs’ indefinitely but to throw them out beforehand.

To the Font.php public static function uchr($code): string i add the following check on unicode.

if ($code < 0 || $code > 0x10FFFF) { throw new \Exception('Invalid Unicode character code: ' . $code); }

That makes sure that it doesn't go on indefinitely for me. I mainly have problems with PDFs that contain attachments like ZUGFeRD invoices (PDF + XML Attachment). I have tested it with my PDFs and with the test_pdf.pdf from this issue above.

ERROR LOG for test_pdf.pdf Exception: Invalid Unicode character code.

AykutCevik · 2024-12-11T21:23:54Z

Still an issue with v2.11.0

k00ni added the bug label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowed memory exhausted when parse the PDF file. #631

Allowed memory exhausted when parse the PDF file. #631

durifal commented Aug 8, 2023 •

edited

Loading

k00ni commented Aug 8, 2023

durifal commented Aug 8, 2023 •

edited

Loading

k00ni commented Aug 9, 2023

denydias commented Oct 31, 2023

k00ni commented Oct 31, 2023

denydias commented Oct 31, 2023

denydias commented Oct 31, 2023 •

edited

Loading

k00ni commented Nov 1, 2023 •

edited

Loading

denydias commented Nov 1, 2023

k00ni commented Nov 7, 2023

denydias commented Nov 7, 2023

kreuss90 commented Nov 25, 2023

sj-i commented Dec 8, 2023 •

edited

Loading

denydias commented Dec 8, 2023

4ndrzej commented Mar 11, 2024

intrak commented May 22, 2024

Buschdieb commented Jul 20, 2024 •

edited

Loading

AykutCevik commented Dec 11, 2024

Allowed memory exhausted when parse the PDF file. #631

Allowed memory exhausted when parse the PDF file. #631

Comments

durifal commented Aug 8, 2023 • edited Loading

Description:

PDF input

Expected output & actual output

Code

k00ni commented Aug 8, 2023

durifal commented Aug 8, 2023 • edited Loading

k00ni commented Aug 9, 2023

denydias commented Oct 31, 2023

k00ni commented Oct 31, 2023

denydias commented Oct 31, 2023

denydias commented Oct 31, 2023 • edited Loading

k00ni commented Nov 1, 2023 • edited Loading

denydias commented Nov 1, 2023

k00ni commented Nov 7, 2023

denydias commented Nov 7, 2023

kreuss90 commented Nov 25, 2023

sj-i commented Dec 8, 2023 • edited Loading

Test script

The summary of the memory usage

Finding the culprit arrays

Why these arrays grow so large

Changelog

denydias commented Dec 8, 2023

4ndrzej commented Mar 11, 2024

intrak commented May 22, 2024

Buschdieb commented Jul 20, 2024 • edited Loading

AykutCevik commented Dec 11, 2024

durifal commented Aug 8, 2023 •

edited

Loading

durifal commented Aug 8, 2023 •

edited

Loading

denydias commented Oct 31, 2023 •

edited

Loading

k00ni commented Nov 1, 2023 •

edited

Loading

sj-i commented Dec 8, 2023 •

edited

Loading

Buschdieb commented Jul 20, 2024 •

edited

Loading