[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

JoKalliauer · 2024-04-24T13:45:48Z

Describe the bug

If I run ocrmypdf with --skip-text on some pdf-files with "real text", than the existing text gets replaced by ��

The reason is, to run it on real text because I have a document with sensitive information similar to this one that also contains screenshots.

Steps to reproduce

wget https://www.fcp.at/sites/default/files/2019-08/abstract_diplomarbeit_moschen.pdf
ocrmypdf -j 1 --optimize 01 -l deu+eng abstract_diplomarbeit_moschen.pdf output.pdf --skip-text -v1
Open output.pdf
Copy text into any text-application (notepad++/editor/writer/libre office/...)

Files

abstract_diplomarbeit_moschen.pdf

output.pdf

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

16.1.1

Relevant log output

log output (click to open)

ocrmypdf 16.1.1                                                                                                                                                               __main__.py:59
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Found tesseract 5.3.4.post44                                                                                                                                                 __init__.py:342
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Found gs 10.2.1                                                                                                                                                              __init__.py:342
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Running: ['tesseract', '--list-langs']                                                                                                                                       __init__.py:133
stdout/stderr = List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (3):                                                                                    __init__.py:73
deu
eng
osd

pikepdf mmap enabled                                                                                                                                                          helpers.py:326
os.symlink(abstract_diplomarbeit_moschen_ink.pdf, /tmp/ocrmypdf.io.esbdwxy5/origin)                                                                                           helpers.py:179
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/origin, /tmp/ocrmypdf.io.esbdwxy5/origin.pdf)                                                                                            helpers.py:179
Gathering info with 1 thread workers                                                                                                                                             info.py:772
pikepdf mmap enabled                                                                                                                                                          helpers.py:326
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Using Tesseract OpenMP thread limit 1                                                                                                                                   tesseract_ocr.py:183
pikepdf mmap enabled                                                                                                                                                          helpers.py:326
    1 skipping all processing on this page                                                                                                                                  _pipeline.py:319
    2 skipping all processing on this page                                                                                                                                  _pipeline.py:319
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                         _graft.py:140
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                     _graft.py:165
    2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                         _graft.py:140
    2 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                     _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Postprocessing...                                                                                                                                                                 ocr.py:146
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/graft_layers.pdf, /tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf)                                                                             helpers.py:179
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None',                                               __init__.py:133
'-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2',
'-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf', '/tmp/ocrmypdf.io.esbdwxy5/pdfa.ps']
GPL Ghostscript 10.02.1 (2023-11-01)                                                                                                                                         __init__.py:108
Copyright (C) 2023 Artifex Software, Inc.  All rights reserved.                                                                                                              __init__.py:108
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                                                                   __init__.py:108
see the file COPYING for details.                                                                                                                                            __init__.py:108
Processing pages 1 through 2.                                                                                                                                                __init__.py:108
Page 1                                                                                                                                                                       __init__.py:108
Page 2                                                                                                                                                                       __init__.py:108
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Optimizable images: JPEGs: 0 PNGs: 0                                                                                                                                         optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Optimizable images: JBIG2 groups: 0                                                                                                                                          optimize.py:360
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/optimize.opt.pdf, /tmp/ocrmypdf.io.esbdwxy5/optimize.pdf)                                                                                helpers.py:179
Running: ['jbig2', '--version']                                                                                                                                              __init__.py:133
Running: ['pngquant', '--version']                                                                                                                                           __init__.py:133
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                _pipeline.py:976
Total file size ratio: 0.73 savings: -37.0%

The text was updated successfully, but these errors were encountered:

maksimkurb · 2024-12-16T16:03:16Z

Oh, finally I found this issue. I faced it on paperless-ngx when I tried to upload some of documents. Russian cyrillic text become "Äàòà ðàñ÷åòîâ" instead of something meaningful.

Here is my test file: TestRus.pdf

If you set --output-type pdf (instead of pdfa), output should be ok.

I think it somehow can be resolved if OCRmyPDF would add additional configuration parameter to optionally downgrade from pdfa to pdf if conversion to pdfa (without breaking something) is not available.

It could be done the following way:

(if new option --pdfa-compatibility-policy equals 2) change this line to -dPDFACompatibilityPolicy=2. GhostScript would crash on bad documents where it can't determine correct encoding.
(if new option --pdfa-downgrade-to-pdf-on-error equals true) then catch exception from generate_pdfa method and ignore it, by outputting plain PDF as if we used --output-type pdf.

Then OCRmyPDF would output only either valid PDF/A document or almost untouched PDF (but without breaking encoding)

How to reproduce encoding bug:

import ocrmypdf
import os
import pathlib

os.environ['LC_ALL'] = 'C.UTF-8'
os.environ['LANG'] = 'C.UTF-8'

args = {
	'input_file': pathlib.Path('./TestRus.pdf'),
	'output_file': pathlib.Path('./TestRus.out.pdf'),
	'use_threads': True,
	'jobs': 6,
	'language': 'rus+eng',
	'output_type': 'pdf',
	'progress_bar': False,
	'color_conversion_strategy': 'RGB',
	'skip_text': True,
	'clean': True,
	'deskew': True,
	'rotate_pages': True,
	'rotate_pages_threshold': 12.0,
	'sidecar': pathlib.Path('./sidecar.txt')
}
 
print(f"Running ocrmypdf with args: {args}")
ocrmypdf.configure_logging(verbosity=2)
ocrmypdf.ocr(**args)

# Run file like this:
# >  python script.py && pdftotext TestRus.out.pdf - | head -n 2
# Expected Result: You should see valid Cyrillic e.g. "Валюта"
# Actual Result: Encoding is broken, you see "Äàòà ðàñ÷åòîâ"

JoKalliauer added the bug label Apr 24, 2024

JoKalliauer assigned jbarlow83 Apr 24, 2024

JoKalliauer changed the title ~~[Bug]: real text replaced by � �~~ [Bug]: real text replaced by � � (visually unchanged, only by copying) Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

JoKalliauer commented Apr 24, 2024 •

edited

Loading

maksimkurb commented Dec 16, 2024 •

edited

Loading

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

Comments

JoKalliauer commented Apr 24, 2024 • edited Loading

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

maksimkurb commented Dec 16, 2024 • edited Loading

JoKalliauer commented Apr 24, 2024 •

edited

Loading

maksimkurb commented Dec 16, 2024 •

edited

Loading