-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowed memory exhausted when parse the PDF file. #631
Comments
Thanks for reporting. What program did you use to generate the PDF? To be sure the error still exist, please try again with latest master branch. |
I do not know what generated the PDF, because visitors of our sites uploaded it as Cover letter, which we try to parse so full-text would search also in attachment. I have just edit the PDF in Adobe PDF editor to anonymize data. We hit this problem multiple times during parsing the PDFs, so if necessary I can anonymize more examples. But it is pretty rare (about 10 PDFs out of 1 000 000). All of them had on one site text on some background color. I have tested problematic PDF with the same result also on master branch: Fatal error: Allowed memory size of 1077936128 bytes exhausted (tried to allocate 335544320 bytes) in ........../smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 230 |
Thank you for the feedback. |
A similar issue also hit me. I'll post this here as this looks like a common unhandled exception, but let me know if you need an specific issue. Just like the OP, a small portion of a much larger batch appears to be affected. As for the PDF creator:
PdfParser exception:
|
Thank you for the quick reply, @k00ni! I'll try the PR and let you know the results. Please expect some delay as these are very busy days here. |
@k00ni is there a way to send the source document for your eyes only? It could not be shared in public. As for the tests with #634, before (using v2.7.0):
After (using master+#634):
My env:
Test script: <?php
ini_set("memory_limit", "128M");
require __DIR__ . '/vendor/autoload.php';
use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;
$pages = getPDFPageCount('test.pdf', 'test');
echo "File has $pages pages\n";
function getPDFPageCount(string $file, string $origin): mixed
{
$config = new Config();
$config->setRetainImageContent(false);
$parser = new Parser([], $config);
try {
$pdf = $parser->parseFile($file);
$details = $pdf->getDetails();
return $details['Pages'];
} catch (Exception $e) {
$pages = 0;
echo $e->getMessage();
return $pages;
}
} |
@denydias Thank you for your detailled answer. Don't send me the PDF privately, I don't to private support via mail. #634 is the latest big set of changes, there was a chance that it might cover this case. The problem with these errors is, that they seem to be very PDF-dependent. We need further work on the parsing part to avoid endless loops/recursion. |
@k00ni I understand you don't provide private support and I'm not asking you to. I'm reporting an issue and looking to privately provide you with the entity where the problem occurs in the hope you can improve your product, but asking no warranties or even replies on that matter. In most cases I agree with you for the PDF-dependent claim, but this particular one is part of a set with 1.706 files produced by a "pretty standard" (TM) workflow. As just this one triggers an exception, it looks like a perfect candidate for an edge case worth looking into. But this is not my call. |
You are right. Would you create a pull request and help us solve the issue? |
I'll dive into it when I get the time, @k00ni. |
I have the same issue (memory exhausted [in my case 500MB]) also with just one pdf on my website. I will provide a link to the document at the end of this post. Creator: Microsoft PowerPoint 2016 I hope this helps you find the bug. Thanks for providing that great library! Kind regards |
To test a development version of our memory profiler, I've tried to investigate the leak in the original issue. Test script<?php
use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;
include "vendor/autoload.php";
ini_set('memory_limit', '128M');
register_shutdown_function(
function (): void {
$error = error_get_last();
if (is_null($error)) {
return;
}
if (strpos($error['message'], 'Allowed memory size of') !== 0) {
return;
}
$pid = getmypid();
$file_opt = '--memory-limit-error-file=' . escapeshellarg($error['file']);
$line_opt = '--memory-limit-error-line=' . escapeshellarg($error['line']);
system("sudo reli i:m -p {$pid} --no-stop-process {$file_opt} {$line_opt} >memory_analyzed.json");
}
);
$config = new Config();
$url = __DIR__ . '/test_pdf.pdf';
$config->setRetainImageContent(false);
$config->setDecodeMemoryLimit(1000000);
$parser = new Parser([], $config);
$pdf = $parser->parseFile($url); The summary of the memory usage~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .summary
[
{
"zend_mm_heap_total": 130023424,
"zend_mm_heap_usage": 128245688,
"zend_mm_chunk_total": 46137344,
"zend_mm_chunk_usage": 44359608,
"zend_mm_huge_total": 83886080,
"zend_mm_huge_usage": 83886080,
"vm_stack_total": 262144,
"vm_stack_usage": 1632,
"compiler_arena_total": 458752,
"compiler_arena_usage": 7264,
"possible_allocation_overhead_total": 3893453,
"possible_array_overhead_total": 248704,
"memory_get_usage": 128276816,
"memory_get_real_usage": 130023424,
"cached_chunks_size": 0,
"heap_memory_analyzed_percentage": 99.97573372884466,
"php_version": "v82",
"analyzer": "reli 0.11.0"
}
] ~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .location_types_summary | jq -r '(["location_type", "count", "memory_usage"] | (., map(length*"="))),(to_entries[]|[.key,.value.count,.value.memory_usage])|@tsv' | column -t -o ' | '
location_type | count | memory_usage
============= | ===== | ============
ZendArrayTableMemoryLocation | 600 | 84052280
ZendStringMemoryLocation | 1049683 | 38511955
ZendObjectMemoryLocation | 10278 | 742320
ZendArrayTableOverheadMemoryLocation | 595 | 159296
ObjectsStoreMemoryLocation | 1 | 131072
ZendArrayMemoryLocation | 602 | 33712
RuntimeCacheMemoryLocation | 101 | 7360
CallFrameVariableTableMemoryLocation | 9 | 832
CallFrameHeaderMemoryLocation | 10 | 800
ZendOpArrayHeaderMemoryLocation | 1 | 248
StaticMembersTableMemoryLocation | 5 | 176
ZendResourceMemoryLocation | 3 | 72
ZendReferenceMemoryLocation | 2 | 64
ZendMmHugeListMemoryLocation | 2 | 48 As you can see in the above, arrays and strings occupy the majority of memory consumption. Finding the culprit arraysLet's extract the 20 largest ones in order of size. ~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '. as $root | path(..|objects|select(."#type"=="ArrayElementsContext"))| . as $path | $root|getpath($path) as $elements | {path: $path|join("."), size: $elements."#locations"[0].size, count: $elements."#count"}' | jq -rs '(["size", "count", "path"] | (., map(length*"="))),(sort_by(.size) | .[-20:] | reverse | .[] | [.size, .count, .path])|@tsv' | column -t -o ' | '
size | count | path
==== | ===== | ====
41943040 | 1048576 | context.class_table.smalot\\pdfparser\\font.static_properties.uchrCache.array_elements
41913376 | 1047649 | context.call_frames.3.this.object_properties.table.array_elements
36552 | 2284 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
13720 | 857 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.26_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
8840 | 552 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
6536 | 408 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.545.value.object_properties.value.array_elements
3496 | 218 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.36.value.object_properties.value.array_elements
3480 | 217 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.148.value.object_properties.value.array_elements
3352 | 209 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.428.value.object_properties.value.array_elements
3264 | 70 | context.call_frames.9.symbol_table.array_elements._SERVER.value.array_elements
2216 | 138 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.110.value.object_properties.value.array_elements
1864 | 116 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.413.value.object_properties.value.array_elements
1784 | 111 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.123.value.object_properties.value.array_elements
1696 | 37 | context.call_frames.7.local_variables.xref.array_elements.xref.value.array_elements
1688 | 105 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.572.value.object_properties.value.array_elements
1672 | 104 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.26.value.object_properties.value.array_elements
1608 | 100 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.299.value.object_properties.value.array_elements
1600 | 34 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements
1496 | 93 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.489.value.object_properties.value.array_elements
1432 | 89 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.2108.value.object_properties.value.array_elements Two arrays are the culprits. Dumping the real stack trace on memory_limit violations is the new feature I want to test on this trial ~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq -r '(["frame_no", "function", "line"] | (., map(length*"="))),(path(.context.call_frames[]|objects) as $path | [$path[2], getpath($path).function_name, getpath($path).lineno])|@tsv' | column -t
frame_no function line
======== ======== ====
0 system 4
1 {closure}(/home/sji/work/oss/tmp/pdfparser_test/test.php:11-21) 20
2 Smalot\\PdfParser\\Font::uchr 150
3 Smalot\\PdfParser\\Font::loadTranslateTable 230
4 Smalot\\PdfParser\\Font::init 78
5 Smalot\\PdfParser\\Document::init 90
6 Smalot\\PdfParser\\Document::setObjects 316
7 Smalot\\PdfParser\\Parser::parseContent 122
8 Smalot\\PdfParser\\Parser::parseFile 90
9 <main> 29 So, two arrays, Why these arrays grow so largeThen let's also dump some seemingly related local variables. ~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '.context.call_frames."3".local_variables |{char: .char, char_from: .char_from, char_to: .char_to, offset: .offset, key: .key}'
{
"char": {
"#node_id": 2147647,
"#type": "ScalarValueContext",
"value": 1047644
},
"char_from": {
"#node_id": 2147644,
"#type": "ScalarValueContext",
"value": 64287
},
"char_to": {
"#node_id": 2147645,
"#type": "ScalarValueContext",
"value": 4276029042
},
"offset": {
"#node_id": 2147646,
"#type": "ScalarValueContext",
"value": 4276094578
},
"key": {
"#node_id": 2147638,
"#type": "ScalarValueContext",
"value": 50
}
}
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq -r '.context.call_frames."3".local_variables |.matches.referenced.array_elements."0".value.array_elements."50".value'
{
"#node_id": 2147143,
"#type": "StringContext",
"#locations": [
{
"address": 139914548230560,
"size": 53,
"refcount": 1,
"type_info": 22,
"value": "<FB1F> <FEDF0672> <FEE00672> "
}
]
} It seems that one of I am not familiar with the PDF specification, so cannot send a PR to fix it. Sorry. I am already happy with the successful testing of my tool, and I hope this report can make someone else happy too. Changelog
|
I am! Superb debug job, @sj-i! 👏 |
We experiencing the same issue. Any news on this? |
Hi There ! Any news with that bug ? |
On Version 2.10.0 i get the Fatal error: Maximum execution time of 20 seconds exceeded in /vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 150 if i set a time limit. My approach is not to process the ‘faulty PDFs’ indefinitely but to throw them out beforehand. To the Font.php public static function uchr($code): string i add the following check on unicode.
That makes sure that it doesn't go on indefinitely for me. I mainly have problems with PDFs that contain attachments like ZUGFeRD invoices (PDF + XML Attachment). I have tested it with my PDFs and with the test_pdf.pdf from this issue above. ERROR LOG for test_pdf.pdf Exception: Invalid Unicode character code. |
Still an issue with |
Description:
Trying to parse this PDF always result in Allowed memory exhausted error.
Error: Allowed memory size of 1077936128 bytes exhausted (tried to
allocate 335544320 bytes) in
...../smalot/pdfparser/src/Smalot/PdfParser/Font.php,
line 223
Set up PHP memory limit to 4GB did not help either.
I have also tried to setDecodeMemoryLimit to lower but still had the same memory issue. Setting Decode memory limit prevent the error only when I set it to 1000 or lower. So maybe it should be set in MB and not in bytes, or there is an bug in the code.
PDF input
test_pdf.pdf
Expected output & actual output
Parser should either parse the text from the PDF, or return empty string or some exception and not memory error.
Code
The text was updated successfully, but these errors were encountered: