Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with getText() on PDF documents with UTF16BE encoding #734

Open
SeedDMS opened this issue Sep 6, 2024 · 0 comments
Open

Problems with getText() on PDF documents with UTF16BE encoding #734

SeedDMS opened this issue Sep 6, 2024 · 0 comments

Comments

@SeedDMS
Copy link

SeedDMS commented Sep 6, 2024

  • PHP Version: 8.2
  • PDFParser Version: 2.11.0

Description:

PDF input

There is a file attached to a bug report of pdftotext https://gitlab.freedesktop.org/poppler/poppler/-/issues/332

2004.pdf

Expected output & actual output

The getText() output returns mostly utf16 encoding text, but it seems like there were non utf16 chars added by the parser.
Besides that, I wonder if there is any way to determine which encoding is use? Or maybe, can the parser do a conversion to utf8?

Code

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($infile);
$t = $pdf->getText();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants