UnicodeEncodeError when printing a filename with invalid UTF-8 after conversion #768

naglis · 2024-04-08T08:03:41Z

If a filename contains invalid UTF-8 sequences, dangerzone-cli fails to print the filename of the successfully converted/failed document due to an UnicodeEncodeError.

Steps to reproduce (use bash for ANSI-C quoting):

Printing filenames of successfully converted documents:

curl -sSL 'https://raw.githubusercontent.com/freedomofpress/dangerzone/main/tests/test_docs/sample-pdf.pdf' -o $'sample\xf0.pdf'
dangerzone-cli $'sample\xf0.pdf'

Output

╭──────────────────────────╮
│           ▄██▄           │
│          ██████          │
│         ███▀▀▀██         │
│        ███   ████        │
│       ███   ██████       │
│      ███   ▀▀▀▀████      │
│     ███████  ▄██████     │
│    ███████ ▄█████████    │
│   ████████████████████   │
│    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀    │
│                          │
│    Dangerzone v0.6.0     │
│ https://dangerzone.rocks │
╰──────────────────────────╯
Assigning ID 'D-TevQ' to doc '/tmp/sample_.pdf'

Converting document to safe PDF
> /usr/bin/podman run --network none -u dangerzone --log-driver none --security-opt no-new-privileges --userns keep-id --cap-drop all --rm -i dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels
[doc D-TevQ] 0% Converting page 1/4 to pixels
[doc D-TevQ] 12% Converting page 2/4 to pixels
[doc D-TevQ] 24% Converting page 3/4 to pixels
[doc D-TevQ] 36% Converting page 4/4 to pixels
[doc D-TevQ] 49% Converted document to pixels
> /usr/bin/podman run --network none -u dangerzone --log-driver none --security-opt no-new-privileges --userns keep-id --cap-drop all --rm -i -v /tmp/tmphot68xih:/safezone:Z -e OCR=0 -e OCR_LANGUAGE=None dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.pixels_to_pdf
[doc D-TevQ] 50% Converting page 1/4 from pixels to PDF
[doc D-TevQ] 61% Converting page 2/4 from pixels to PDF
[doc D-TevQ] 72% Converting page 3/4 from pixels to PDF
[doc D-TevQ] 83% Converting page 4/4 from pixels to PDF
[doc D-TevQ] 100% Safe PDF created

Safe PDF(s) created successfully
Traceback (most recent call last):
  File "/usr/bin/dangerzone-cli", line 33, in <module>
    sys.exit(load_entry_point('dangerzone==0.6.0', 'console_scripts', 'dangerzone-cli')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/dangerzone/errors.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/dangerzone/cli.py", line 101, in cli_main
    click.echo(document.output_filename)
  File "/usr/lib/python3.11/site-packages/click/utils.py", line 318, in echo
    file.write(out)  # type: ignore
    ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 47, in write
    self.__convertor.write(text)
  File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 179, in write
    self.wrapped.write(text)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 11: surrogates not allowed

Printing filenames of documents that failed to convert:

curl -sSL 'https://raw.githubusercontent.com/freedomofpress/dangerzone/main/tests/test_docs/sample_bad_pdf.pdf' -o $'sample-bad\xf0.pdf'
dangerzone-cli $'sample-bad\xf0.pdf'

Output

╭──────────────────────────╮
│           ▄██▄           │
│          ██████          │
│         ███▀▀▀██         │
│        ███   ████        │
│       ███   ██████       │
│      ███   ▀▀▀▀████      │
│     ███████  ▄██████     │
│    ███████ ▄█████████    │
│   ████████████████████   │
│    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀    │
│                          │
│    Dangerzone v0.6.0     │
│ https://dangerzone.rocks │
╰──────────────────────────╯
Assigning ID 'UWTbXN' to doc '/tmp/sample-bad_.pdf'

Converting document to safe PDF
> /usr/bin/podman run --network none -u dangerzone --log-driver none --security-opt no-new-privileges --userns keep-id --cap-drop all --rm -i dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels
ERROR [doc UWTbXN] 0% The document format is not supported

Failed to convert document(s)
Traceback (most recent call last):
  File "/usr/bin/dangerzone-cli", line 33, in <module>
    sys.exit(load_entry_point('dangerzone==0.6.0', 'console_scripts', 'dangerzone-cli')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/dangerzone/errors.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/dangerzone/cli.py", line 111, in cli_main
    click.echo(document.input_filename)
  File "/usr/lib/python3.11/site-packages/click/utils.py", line 318, in echo
    file.write(out)  # type: ignore
    ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 47, in write
    self.__convertor.write(text)
  File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 179, in write
    self.wrapped.write(text)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 15: surrogates not allowed

Tested on v0.6.0 on Arch Linux. I was unable to reproduce it via GUI as I was not able to select files with invalid UTF-8 sequences in the filename in the file dialog (such files are not displayed - possible Qt issue?). Update: GUI is not affected if passing the filename when starting dangerzone.

Since the filenames are printed at the end after all conversions, IIUC it does not impact the conversions, but the exception prevents the filename and and any subsequent filenames from being printed, and in case all documents were converted successfully it also impacts the exit code.

The text was updated successfully, but these errors were encountered:

On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses[1] surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte). From `click` docs[2]: > Invalid bytes or surrogate escapes will raise an error when written > to a stream with `errors="strict"`. This will typically happen with > `stdout` when the locale is something like `en_GB.UTF-8`. To fix that, we use `click.format_filename`[2] before printing the filenames to `stdout` so that surrogate escapes are replaced by �. Fixes freedomofpress#768 [1]: https://peps.python.org/pep-0383/ [2]: https://click.palletsprojects.com/en/8.1.x/api/#click.format_filename

apyrgio · 2024-04-08T10:47:25Z

(such files are not displayed - possible Qt issue?)

Ok, that's unexpected.

Since the filenames are printed at the end after all conversions, IIUC it does not impact the conversions, but the exception prevents the filename and and any subsequent filenames from being printed, and in case all documents were converted successfully it also impacts the exit code.

Nice catch! Thanks a lot for reporting this. That was actually a security concern of ours, and we had added a special function to handle any character in a filename that is not printable (see cfa0c01). So, it seems that we didn't catch everything.

naglis · 2024-04-08T10:59:21Z

and we had added a special function to handle any character in a filename that is not printable (see cfa0c01). So, it seems that we didn't catch everything.

Indeed, I saw replace_control_chars and was wondering why it was not used in all cases where the filenames are printed, just in announce_id and was meaning to ask it in the PR, but ran into a few test failures. IIUC the issue here is writing surrogate escapes to a stream that does not support it, the circumstances are quite nicely summed up in click.format_filename documentation.

On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses[1] surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte). From `click` docs[2]: > Invalid bytes or surrogate escapes will raise an error when written > to a stream with `errors="strict"`. This will typically happen with > `stdout` when the locale is something like `en_GB.UTF-8`. To fix that, we use `click.format_filename`[2] before printing the filenames to `stdout` so that surrogate escapes are replaced by �. Fixes freedomofpress#768 [1]: https://peps.python.org/pep-0383/ [2]: https://click.palletsprojects.com/en/8.1.x/api/#click.format_filename

On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses[1] surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte). From `click` docs[2]: > Invalid bytes or surrogate escapes will raise an error when written > to a stream with `errors="strict"`. This will typically happen with > `stdout` when the locale is something like `en_GB.UTF-8`. To fix that, we use `utils.replace_control_chars()` before printing the filenames to `stdout` so that surrogate escapes are replaced by �. Fixes freedomofpress#768

naglis mentioned this issue Apr 8, 2024

Fix printing of filenames with surrogate escapes #769

Merged

naglis mentioned this issue Apr 8, 2024

Adds drag and drop functionality #752

Merged

apyrgio closed this as completed in d632908 Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError when printing a filename with invalid UTF-8 after conversion #768

UnicodeEncodeError when printing a filename with invalid UTF-8 after conversion #768

naglis commented Apr 8, 2024 •

edited

Loading

apyrgio commented Apr 8, 2024

naglis commented Apr 8, 2024

UnicodeEncodeError when printing a filename with invalid UTF-8 after conversion #768

UnicodeEncodeError when printing a filename with invalid UTF-8 after conversion #768

Comments

naglis commented Apr 8, 2024 • edited Loading

apyrgio commented Apr 8, 2024

naglis commented Apr 8, 2024

naglis commented Apr 8, 2024 •

edited

Loading