Skip to content

Commit

Permalink
Filter ElementHexa::decode() of non-hex chars (#687)
Browse files Browse the repository at this point in the history
* Filter ElementHexa::decode() of non-hex chars

Add a `preg_replace()` to `ElementHexa::decode()` so incoming strings are filtered of all non hexadecimal characters.

Also remove the BOM (`feff`) if it exists. The function does a check for characters '00' at the beginning of the string to decide whether to 4-byte or 2-byte decode this string. It does not account for the 4-byte BOM and decodes such a string in a 2-byte fashion. It depends on further functions (in this case `Parser::parseHeaderElement()`) to repair the incorrectly decoded contents. Checking for and removing the BOM allows `ElementHexa::decode()` to return the correctly decoded contents the first time.

* Update ElementHexa.php

Instead of just deleting/ignoring it, separate out the BE BOM `feff` as an additional check for 4-byte hexadecimal content.

* Cast preg_replace calls to strings
  • Loading branch information
GreyWyvern authored Mar 12, 2024
1 parent 6c9617c commit ca3fea6
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 4 deletions.
14 changes: 10 additions & 4 deletions src/Smalot/PdfParser/Element/ElementHexa.php
Original file line number Diff line number Diff line change
Expand Up @@ -64,15 +64,21 @@ public static function parse(string $content, ?Document $document = null, int &$
public static function decode(string $value): string
{
$text = '';
$length = \strlen($value);

if ('00' === substr($value, 0, 2)) {
for ($i = 0; $i < $length; $i += 4) {
// Filter $value of non-hexadecimal characters
$value = (string) preg_replace('/[^0-9a-f]/i', '', $value);

// Check for leading zeros (4-byte hexadecimal indicator), or
// the BE BOM
if ('00' === substr($value, 0, 2) || 'feff' === strtolower(substr($value, 0, 4))) {
$value = (string) preg_replace('/^feff/i', '', $value);
for ($i = 0, $length = \strlen($value); $i < $length; $i += 4) {
$hex = substr($value, $i, 4);
$text .= '&#'.str_pad(hexdec($hex), 4, '0', \STR_PAD_LEFT).';';
}
} else {
for ($i = 0; $i < $length; $i += 2) {
// Otherwise decode this as 2-byte hexadecimal
for ($i = 0, $length = \strlen($value); $i < $length; $i += 2) {
$hex = substr($value, $i, 2);
$text .= \chr(hexdec($hex));
}
Expand Down
13 changes: 13 additions & 0 deletions tests/PHPUnit/Integration/Element/ElementHexaTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -117,5 +117,18 @@ public function testParse(): void
$this->assertTrue($element instanceof ElementDate);
$this->assertEquals('2013-12-17T13:40:45+00:00', (string) $element);
$this->assertEquals(49, $offset);

// Test that a hexadecimal string 'dirty' with extra characters
// such as newlines or spaces is properly decoded
$element = ElementHexa::decode(' <feff007000610073007100750061002c0020007000720069006d00610076006500720061002c0020
00720065007 30075007200720065007a0069006f006e0065002c0020006600650073007400610020
0063007200690073007400690061006e0061002c002000670065007300f9002c00200075006f0076
0061002000640069 & 002000630069006f00630063006f006c00610074006100 Y 2c00200063006f006e
00690067006c00690065007400740069002c0020007000750 / 06c00630069006e0069002c00200070
00610073007100750061006c0065 002c002000630061006d00700061006e0065002c002000640069
006e006100200072006500620075006300630069002c00200075006f007600610020006400690020
007000610 073007100750061002c0020> ');

$this->assertEquals('pasqua, primavera, resurrezione, festa cristiana, gesù, uova di cioccolata, coniglietti, pulcini, pasquale, campane, dina rebucci, uova di pasqua, ', $element);
}
}

0 comments on commit ca3fea6

Please sign in to comment.