Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Update to PDFObject.php + Ancillary #634

Merged
merged 37 commits into from
Nov 7, 2023
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
8afd5f1
Rewrite PDFObject.php + ancillary
GreyWyvern Aug 14, 2023
b8cd57d
Add test, check for text/plain
GreyWyvern Aug 18, 2023
170af63
Modify some code comments + q and Q update
GreyWyvern Aug 18, 2023
bced73a
Update Font.php
GreyWyvern Aug 18, 2023
7e78a24
PHP-cs-fixer edits
GreyWyvern Aug 18, 2023
449cdde
Address some reviewed changes
GreyWyvern Aug 21, 2023
2167541
Use text matrix 'i' instead of 'b'
GreyWyvern Aug 21, 2023
42d3ec6
Use mb_detect_encoding() instead of finfo()
GreyWyvern Aug 22, 2023
393c084
Update Font.php
GreyWyvern Aug 22, 2023
097bea3
Sort switch statement cases in getTextArray()
GreyWyvern Aug 22, 2023
2ab9be2
Merge branch 'smalot:master' into master
GreyWyvern Aug 23, 2023
cb04a08
Merge branch 'smalot:master' into master
GreyWyvern Aug 24, 2023
055ace6
Edit documentation comments, add tests
GreyWyvern Aug 24, 2023
d9044e1
Update ParserTest.php
GreyWyvern Aug 24, 2023
81bcfd3
Restore original cleanContent()
GreyWyvern Aug 24, 2023
c2453a8
added a few PDFs generated with various tools to check conformance
k00ni Aug 25, 2023
dcc566e
fixed CS issue
k00ni Aug 25, 2023
01f7868
Remove textMatrix from decodeText()
GreyWyvern Aug 25, 2023
89f2459
Account for text leading, font-size factor
GreyWyvern Aug 28, 2023
457002b
Update FontTest.php
GreyWyvern Aug 28, 2023
b6a4ba2
Misc fixes
GreyWyvern Aug 29, 2023
8e3da84
Add SmallPDF sample PDF and unit test
GreyWyvern Sep 7, 2023
5a3a39b
Pass horizontal fontFactor to decodeText()
GreyWyvern Sep 11, 2023
94d748b
Update PDFObject.php
GreyWyvern Sep 13, 2023
9acf3fc
moved new PDFs into special folder
k00ni Sep 19, 2023
d7c0869
split DocumentTest class
k00ni Sep 19, 2023
fa4b2e9
added 3 PDFs generated by different generators + tests
k00ni Sep 19, 2023
999ed7d
More accounting for horizontal font factor
GreyWyvern Sep 19, 2023
2795a78
Revert decodeOctal(), add @internal
GreyWyvern Sep 20, 2023
101aa6d
Proper use of PHPDoc "internal"
GreyWyvern Sep 20, 2023
2ccf0b8
Disable a failing test
GreyWyvern Sep 20, 2023
0064386
Update PDFObject.php
GreyWyvern Sep 21, 2023
7fb1398
Update PDFObject.php
GreyWyvern Sep 21, 2023
5455990
Return empty array if no valid command detected
GreyWyvern Sep 26, 2023
1677af7
Merge branch 'smalot:master' into master
GreyWyvern Sep 26, 2023
59c5a5a
Update DocumentGeneratorFocusTest.php
GreyWyvern Sep 26, 2023
eedf888
Make formatContent() private
GreyWyvern Sep 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added samples/bugs/Issue585.pdf
Binary file not shown.
Binary file added samples/bugs/Issue608.pdf
Binary file not shown.
File renamed without changes.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
40 changes: 35 additions & 5 deletions src/Smalot/PdfParser/Font.php
Original file line number Diff line number Diff line change
Expand Up @@ -391,7 +391,7 @@ public static function decodeEntities(string $text): string
*/
public static function decodeUnicode(string $text): string
{
if (preg_match('/^\xFE\xFF/i', $text)) {
if ("\xFE\xFF" === substr($text, 0, 2)) {
// Strip U+FEFF byte order marker.
$decode = substr($text, 2);
$text = '';
Expand All @@ -416,16 +416,17 @@ protected function getFontSpaceLimit(): int
/**
* Decode text by commands array.
*/
public function decodeText(array $commands): string
public function decodeText(array $commands, float $fontFactor = 4): string
{
$word_position = 0;
$words = [];
$font_space = $this->getFontSpaceLimit();
$font_space = $this->getFontSpaceLimit() * abs($fontFactor) / 4;

foreach ($commands as $command) {
switch ($command[PDFObject::TYPE]) {
case 'n':
if ((float) trim($command[PDFObject::COMMAND]) < $font_space) {
$offset = (float) trim($command[PDFObject::COMMAND]);
if ($offset - (float) $font_space < 0) {
$word_position = \count($words);
}
continue 2;
Expand Down Expand Up @@ -456,9 +457,32 @@ public function decodeText(array $commands): string

foreach ($words as &$word) {
$word = $this->decodeContent($word);
$word = str_replace("\t", ' ', $word);
}

return implode(' ', $words);
// Remove internal "words" that are just spaces, but leave them
// if they are at either end of the array of words. This fixes,
// for example, lines that are justified to fill
// a whole row.
for ($x = \count($words) - 2; $x >= 1; --$x) {
if ('' === trim($words[$x], ' ')) {
unset($words[$x]);
}
}
$words = array_values($words);

// Cut down on the number of unnecessary internal spaces by
// imploding the string on the null byte, and checking if the
// text includes extra spaces on either side. If so, merge
// where appropriate.
$words = implode("\x00\x00", $words);
$words = str_replace(
[" \x00\x00 ", "\x00\x00 ", " \x00\x00", "\x00\x00"],
[' ', ' ', ' ', ' '],
$words
);

return $words;
}

/**
Expand All @@ -468,6 +492,12 @@ public function decodeText(array $commands): string
*/
public function decodeContent(string $text, bool &$unicode = null): string
{
// If this string begins with a UTF-16BE BOM, then decode it
// directly as Unicode
if ("\xFE\xFF" === substr($text, 0, 2)) {
return $this->decodeUnicode($text);
}

if ($this->has('ToUnicode')) {
return $this->decodeContentByToUnicodeCMapOrDescendantFonts($text);
}
Expand Down
Loading