You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Oh, finally I found this issue. I faced it on paperless-ngx when I tried to upload some of documents. Russian cyrillic text become "Äàòà ðàñ÷åòîâ" instead of something meaningful.
If you set --output-type pdf (instead of pdfa), output should be ok.
I think it somehow can be resolved if OCRmyPDF would add additional configuration parameter to optionally downgrade from pdfa to pdf if conversion to pdfa (without breaking something) is not available.
It could be done the following way:
(if new option --pdfa-compatibility-policy equals 2) change this line to -dPDFACompatibilityPolicy=2. GhostScript would crash on bad documents where it can't determine correct encoding.
(if new option --pdfa-downgrade-to-pdf-on-error equals true) then catch exception from generate_pdfa method and ignore it, by outputting plain PDF as if we used --output-type pdf.
Then OCRmyPDF would output only either valid PDF/A document or almost untouched PDF (but without breaking encoding)
How to reproduce encoding bug:
importocrmypdfimportosimportpathlibos.environ['LC_ALL'] ='C.UTF-8'os.environ['LANG'] ='C.UTF-8'args= {
'input_file': pathlib.Path('./TestRus.pdf'),
'output_file': pathlib.Path('./TestRus.out.pdf'),
'use_threads': True,
'jobs': 6,
'language': 'rus+eng',
'output_type': 'pdf',
'progress_bar': False,
'color_conversion_strategy': 'RGB',
'skip_text': True,
'clean': True,
'deskew': True,
'rotate_pages': True,
'rotate_pages_threshold': 12.0,
'sidecar': pathlib.Path('./sidecar.txt')
}
print(f"Running ocrmypdf with args: {args}")
ocrmypdf.configure_logging(verbosity=2)
ocrmypdf.ocr(**args)
# Run file like this:# > python script.py && pdftotext TestRus.out.pdf - | head -n 2# Expected Result: You should see valid Cyrillic e.g. "Валюта"# Actual Result: Encoding is broken, you see "Äàòà ðàñ÷åòîâ"
Describe the bug
If I run ocrmypdf with --skip-text on some pdf-files with "real text", than the existing text gets replaced by ��
The reason is, to run it on real text because I have a document with sensitive information similar to this one that also contains screenshots.
Steps to reproduce
wget https://www.fcp.at/sites/default/files/2019-08/abstract_diplomarbeit_moschen.pdf
ocrmypdf -j 1 --optimize 01 -l deu+eng abstract_diplomarbeit_moschen.pdf output.pdf --skip-text -v1
Files
abstract_diplomarbeit_moschen.pdf
output.pdf
How did you download and install the software?
Linux package manager (apt, dnf, etc.)
OCRmyPDF version
16.1.1
Relevant log output
log output (click to open)
The text was updated successfully, but these errors were encountered: