Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Original and Converted Document Dimentions Don't Match #626

Closed
deeplow opened this issue Nov 27, 2023 · 3 comments
Closed

Original and Converted Document Dimentions Don't Match #626

deeplow opened this issue Nov 27, 2023 · 3 comments

Comments

@deeplow
Copy link
Contributor

deeplow commented Nov 27, 2023

Comment by @j75 from another issue:

the original document (https://s1.q4cdn.com/806093406/files/doc_downloads/test.pdf) has Page size: 612 x 792 pts (letter) the test-safe.pdf document has Page size: 1275 x 1650 pts

I had just come across this as well:

original size converted size
width (inches) 8.27 17.73
height (inches) 11.7 25.06
width (points) 595.28 1276.46
height (points) 841.89 1804.11
@deeplow
Copy link
Contributor Author

deeplow commented Nov 28, 2023

It turns out that this is an issue with the mismatch of the pixel density when converting into pixels and when reassembling it back into a PDF. This is further conformed by the difference in document size when doing OCR (uses tesseract under the hood) and no OCR (which uses GraphicsMagic).

document width x height
original 612 x 792 pts (letter)
with OCR 1311.43 x 1697.14 pts
no OCR 1275 x 1650 pts

In Doc to Pixels, it uses pdftoppm with its default configuration (dpi=150).

However, when reassembling it, with OCR it uses dpi=70 and it uses GraphicsMagick with default params, which I assume is using dpi=72).

To fix this I've had success in setting the tesseract-ocr to dpi=150. However, for the non-OCR option I don't have a clue how to fix since with GraphicsMagick the -density param isn't making a difference. But it's not too critical until we decide on whether or not we want #627.

@j75
Copy link

j75 commented Dec 1, 2023

uses tesseract under the hood - well, in this case there might be another issue (at least on Ubuntu) - the package does not depends on tesseract!

Depends: python3:any (>= 3.6~), podman, python3, python3-pyside2.qtcore, python3-pyside2.qtgui,
         python3-pyside2.qtwidgets, python3-pyside2.qtsvg, python3-appdirs, python3-click,
         python3-xdg, python3-colorama, python3-requests, python3-markdown, python3-packaging

In my case fortunately I already have this package installed, however in the general case I think it should be part of deb's dependencies.

@deeplow
Copy link
Contributor Author

deeplow commented Dec 1, 2023

Oh, that part is fine because tesseract comes pre-installed in the container image we use. So even if one's system doesn't have tesseract, it will still work.

deeplow added a commit that referenced this issue Dec 4, 2023
The original document was larger in dimensions than the original one due
to a mismatch in DPI settings. When converting documents to pixels we
were setting the DPI to 150 pixels per inch. Then when converting back
into a PDF we were using 70 DPI. This difference would result in an
overall larger document in dimensions (though not necessarily in file
size).

Fixes #626
deeplow added a commit that referenced this issue Dec 22, 2023
The original document was larger in dimensions than the original one due
to a mismatch in DPI settings. When converting documents to pixels we
were setting the DPI to 150 pixels per inch. Then when converting back
into a PDF we were using 70 DPI. This difference would result in an
overall larger document in dimensions (though not necessarily in file
size).

Fixes #626
deeplow added a commit that referenced this issue Dec 22, 2023
The original document was larger in dimensions than the original one due
to a mismatch in DPI settings. When converting documents to pixels we
were setting the DPI to 150 pixels per inch. Then when converting back
into a PDF we were using 70 DPI. This difference would result in an
overall larger document in dimensions (though not necessarily in file
size).

Fixes #626
@deeplow deeplow closed this as completed in 576cbd3 Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants