Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

Open
JoKalliauer opened this issue Apr 24, 2024 · 1 comment
Assignees
Labels

Comments

@JoKalliauer
Copy link
Contributor

JoKalliauer commented Apr 24, 2024

Describe the bug

If I run ocrmypdf with --skip-text on some pdf-files with "real text", than the existing text gets replaced by ��

The reason is, to run it on real text because I have a document with sensitive information similar to this one that also contains screenshots.

Steps to reproduce

  1. wget https://www.fcp.at/sites/default/files/2019-08/abstract_diplomarbeit_moschen.pdf
  2. ocrmypdf -j 1 --optimize 01 -l deu+eng abstract_diplomarbeit_moschen.pdf output.pdf --skip-text -v1
  3. Open output.pdf
  4. Copy text into any text-application (notepad++/editor/writer/libre office/...)

Files

abstract_diplomarbeit_moschen.pdf

output.pdf

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

16.1.1

Relevant log output

log output (click to open)
ocrmypdf 16.1.1                                                                                                                                                               __main__.py:59
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Found tesseract 5.3.4.post44                                                                                                                                                 __init__.py:342
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Found gs 10.2.1                                                                                                                                                              __init__.py:342
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Running: ['tesseract', '--list-langs']                                                                                                                                       __init__.py:133
stdout/stderr = List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (3):                                                                                    __init__.py:73
deu
eng
osd

pikepdf mmap enabled                                                                                                                                                          helpers.py:326
os.symlink(abstract_diplomarbeit_moschen_ink.pdf, /tmp/ocrmypdf.io.esbdwxy5/origin)                                                                                           helpers.py:179
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/origin, /tmp/ocrmypdf.io.esbdwxy5/origin.pdf)                                                                                            helpers.py:179
Gathering info with 1 thread workers                                                                                                                                             info.py:772
pikepdf mmap enabled                                                                                                                                                          helpers.py:326
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Using Tesseract OpenMP thread limit 1                                                                                                                                   tesseract_ocr.py:183
pikepdf mmap enabled                                                                                                                                                          helpers.py:326
    1 skipping all processing on this page                                                                                                                                  _pipeline.py:319
    2 skipping all processing on this page                                                                                                                                  _pipeline.py:319
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                         _graft.py:140
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                     _graft.py:165
    2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                         _graft.py:140
    2 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                     _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Postprocessing...                                                                                                                                                                 ocr.py:146
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/graft_layers.pdf, /tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf)                                                                             helpers.py:179
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None',                                               __init__.py:133
'-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2',
'-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf', '/tmp/ocrmypdf.io.esbdwxy5/pdfa.ps']
GPL Ghostscript 10.02.1 (2023-11-01)                                                                                                                                         __init__.py:108
Copyright (C) 2023 Artifex Software, Inc.  All rights reserved.                                                                                                              __init__.py:108
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                                                                   __init__.py:108
see the file COPYING for details.                                                                                                                                            __init__.py:108
Processing pages 1 through 2.                                                                                                                                                __init__.py:108
Page 1                                                                                                                                                                       __init__.py:108
Page 2                                                                                                                                                                       __init__.py:108
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Optimizable images: JPEGs: 0 PNGs: 0                                                                                                                                         optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Optimizable images: JBIG2 groups: 0                                                                                                                                          optimize.py:360
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/optimize.opt.pdf, /tmp/ocrmypdf.io.esbdwxy5/optimize.pdf)                                                                                helpers.py:179
Running: ['jbig2', '--version']                                                                                                                                              __init__.py:133
Running: ['pngquant', '--version']                                                                                                                                           __init__.py:133
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                _pipeline.py:976
Total file size ratio: 0.73 savings: -37.0%
@JoKalliauer JoKalliauer changed the title [Bug]: real text replaced by � � [Bug]: real text replaced by � � (visually unchanged, only by copying) Apr 29, 2024
@maksimkurb
Copy link

maksimkurb commented Dec 16, 2024

Oh, finally I found this issue. I faced it on paperless-ngx when I tried to upload some of documents. Russian cyrillic text become "Äàòà ðàñ÷åòîâ" instead of something meaningful.

Here is my test file: TestRus.pdf

If you set --output-type pdf (instead of pdfa), output should be ok.

I think it somehow can be resolved if OCRmyPDF would add additional configuration parameter to optionally downgrade from pdfa to pdf if conversion to pdfa (without breaking something) is not available.

It could be done the following way:

  1. (if new option --pdfa-compatibility-policy equals 2) change this line to -dPDFACompatibilityPolicy=2. GhostScript would crash on bad documents where it can't determine correct encoding.
  2. (if new option --pdfa-downgrade-to-pdf-on-error equals true) then catch exception from generate_pdfa method and ignore it, by outputting plain PDF as if we used --output-type pdf.

Then OCRmyPDF would output only either valid PDF/A document or almost untouched PDF (but without breaking encoding)

How to reproduce encoding bug:

import ocrmypdf
import os
import pathlib

os.environ['LC_ALL'] = 'C.UTF-8'
os.environ['LANG'] = 'C.UTF-8'

args = {
	'input_file': pathlib.Path('./TestRus.pdf'),
	'output_file': pathlib.Path('./TestRus.out.pdf'),
	'use_threads': True,
	'jobs': 6,
	'language': 'rus+eng',
	'output_type': 'pdf',
	'progress_bar': False,
	'color_conversion_strategy': 'RGB',
	'skip_text': True,
	'clean': True,
	'deskew': True,
	'rotate_pages': True,
	'rotate_pages_threshold': 12.0,
	'sidecar': pathlib.Path('./sidecar.txt')
}
 
print(f"Running ocrmypdf with args: {args}")
ocrmypdf.configure_logging(verbosity=2)
ocrmypdf.ocr(**args)

# Run file like this:
# >  python script.py && pdftotext TestRus.out.pdf - | head -n 2
# Expected Result: You should see valid Cyrillic e.g. "Валюта"
# Actual Result: Encoding is broken, you see "Äàòà ðàñ÷åòîâ"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants