Problems with getText() on PDF documents with UTF16BE encoding #734

SeedDMS · 2024-09-06T10:32:17Z

PHP Version: 8.2
PDFParser Version: 2.11.0

Description:

PDF input

There is a file attached to a bug report of pdftotext https://gitlab.freedesktop.org/poppler/poppler/-/issues/332

2004.pdf

Expected output & actual output

The getText() output returns mostly utf16 encoding text, but it seems like there were non utf16 chars added by the parser.
Besides that, I wonder if there is any way to determine which encoding is use? Or maybe, can the parser do a conversion to utf8?

Code

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($infile);
$t = $pdf->getText();

k00ni added bug de-/encoding issue labels Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with getText() on PDF documents with UTF16BE encoding #734

Problems with getText() on PDF documents with UTF16BE encoding #734

SeedDMS commented Sep 6, 2024

Problems with getText() on PDF documents with UTF16BE encoding #734

Problems with getText() on PDF documents with UTF16BE encoding #734

Comments

SeedDMS commented Sep 6, 2024

Description:

PDF input

Expected output & actual output

Code