Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HorizontalOffset is not supported anymore #736

Open
luigif opened this issue Sep 20, 2024 · 2 comments
Open

HorizontalOffset is not supported anymore #736

luigif opened this issue Sep 20, 2024 · 2 comments

Comments

@luigif
Copy link

luigif commented Sep 20, 2024

The config property HorizontalOffset, that was useful in dealing with format issues (https://github.com/smalot/pdfparser/blob/v2.11.0/doc/CustomConfig.md), is not checked anymore.
It can be set as described in the docs, but it's useless.

The last version checking and using its value was 2.7.0, any later version ignores its settings.

@k00ni k00ni added the bug label Sep 23, 2024
@GreyWyvern
Copy link
Contributor

Yeah, my rewrite of the document stream parsing code dropped this config variable off the table. The unit tests just test that it returns the value properly instead of actually testing it against document text, so my changes sailed through without errors.

One place where this config value definitely could be inserted back is in Font.php near the bottom of the decodeText() function:

// Cut down on the number of unnecessary internal spaces by
// imploding the string on the null byte, and checking if the
// text includes extra spaces on either side. If so, merge
// where appropriate.
$words = implode("\x00\x00", $words);
$hOffset = $this->config->getHorizontalOffset();
$words = str_replace(
    [" \x00\x00 ", "\x00\x00 ", " \x00\x00", "\x00\x00"],
    [' '.$hOffset.' ', $hOffset.' ', ' '.$hOffset, $hOffset],
    $words
);

... but this is probably not going to affect as many places in the generated text as the previous algorithm did. If you can check whether inserting this code solves your particular issue @luigif, we could add this back in as at least a partial fix.

Note: I'm not sure the above is the final fix; I'll have to run it on more test documents.

@k00ni k00ni added needs more info and removed bug labels Sep 27, 2024
@luigif
Copy link
Author

luigif commented Sep 27, 2024

The patch in Fonts.php does not solve my problem.
With previous library versions I was able to fix issues in tables with some HorizontalOffset tweaking.

If you need a pdf example you can check the tables in the following document:
https://www.figc-sardegna.it/wp-content/plugins/download-attachments/includes/download.php?id=19995
In the converted text spaces are added or subtracted randomly breaking the tables formatting.

If you have any idea of where to look or what parameters are relevant to the issue I can do more tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants