Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError when printing a filename with invalid UTF-8 after conversion #768

Closed
naglis opened this issue Apr 8, 2024 · 2 comments · Fixed by #769
Closed

UnicodeEncodeError when printing a filename with invalid UTF-8 after conversion #768

naglis opened this issue Apr 8, 2024 · 2 comments · Fixed by #769

Comments

@naglis
Copy link
Contributor

naglis commented Apr 8, 2024

If a filename contains invalid UTF-8 sequences, dangerzone-cli fails to print the filename of the successfully converted/failed document due to an UnicodeEncodeError.

Steps to reproduce (use bash for ANSI-C quoting):

Printing filenames of successfully converted documents:

curl -sSL 'https://raw.githubusercontent.com/freedomofpress/dangerzone/main/tests/test_docs/sample-pdf.pdf' -o $'sample\xf0.pdf'
dangerzone-cli $'sample\xf0.pdf'
Output
╭──────────────────────────╮
│           ▄██▄           │
│          ██████          │
│         ███▀▀▀██         │
│        ███   ████        │
│       ███   ██████       │
│      ███   ▀▀▀▀████      │
│     ███████  ▄██████     │
│    ███████ ▄█████████    │
│   ████████████████████   │
│    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀    │
│                          │
│    Dangerzone v0.6.0     │
│ https://dangerzone.rocks │
╰──────────────────────────╯
Assigning ID 'D-TevQ' to doc '/tmp/sample_.pdf'

Converting document to safe PDF
> /usr/bin/podman run --network none -u dangerzone --log-driver none --security-opt no-new-privileges --userns keep-id --cap-drop all --rm -i dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels
[doc D-TevQ] 0% Converting page 1/4 to pixels
[doc D-TevQ] 12% Converting page 2/4 to pixels
[doc D-TevQ] 24% Converting page 3/4 to pixels
[doc D-TevQ] 36% Converting page 4/4 to pixels
[doc D-TevQ] 49% Converted document to pixels
> /usr/bin/podman run --network none -u dangerzone --log-driver none --security-opt no-new-privileges --userns keep-id --cap-drop all --rm -i -v /tmp/tmphot68xih:/safezone:Z -e OCR=0 -e OCR_LANGUAGE=None dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.pixels_to_pdf
[doc D-TevQ] 50% Converting page 1/4 from pixels to PDF
[doc D-TevQ] 61% Converting page 2/4 from pixels to PDF
[doc D-TevQ] 72% Converting page 3/4 from pixels to PDF
[doc D-TevQ] 83% Converting page 4/4 from pixels to PDF
[doc D-TevQ] 100% Safe PDF created

Safe PDF(s) created successfully
Traceback (most recent call last):
  File "/usr/bin/dangerzone-cli", line 33, in <module>
    sys.exit(load_entry_point('dangerzone==0.6.0', 'console_scripts', 'dangerzone-cli')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/dangerzone/errors.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/dangerzone/cli.py", line 101, in cli_main
    click.echo(document.output_filename)
  File "/usr/lib/python3.11/site-packages/click/utils.py", line 318, in echo
    file.write(out)  # type: ignore
    ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 47, in write
    self.__convertor.write(text)
  File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 179, in write
    self.wrapped.write(text)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 11: surrogates not allowed

Printing filenames of documents that failed to convert:

curl -sSL 'https://raw.githubusercontent.com/freedomofpress/dangerzone/main/tests/test_docs/sample_bad_pdf.pdf' -o $'sample-bad\xf0.pdf'
dangerzone-cli $'sample-bad\xf0.pdf'
Output
╭──────────────────────────╮
│           ▄██▄           │
│          ██████          │
│         ███▀▀▀██         │
│        ███   ████        │
│       ███   ██████       │
│      ███   ▀▀▀▀████      │
│     ███████  ▄██████     │
│    ███████ ▄█████████    │
│   ████████████████████   │
│    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀    │
│                          │
│    Dangerzone v0.6.0     │
│ https://dangerzone.rocks │
╰──────────────────────────╯
Assigning ID 'UWTbXN' to doc '/tmp/sample-bad_.pdf'

Converting document to safe PDF
> /usr/bin/podman run --network none -u dangerzone --log-driver none --security-opt no-new-privileges --userns keep-id --cap-drop all --rm -i dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels
ERROR [doc UWTbXN] 0% The document format is not supported

Failed to convert document(s)
Traceback (most recent call last):
  File "/usr/bin/dangerzone-cli", line 33, in <module>
    sys.exit(load_entry_point('dangerzone==0.6.0', 'console_scripts', 'dangerzone-cli')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/dangerzone/errors.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/dangerzone/cli.py", line 111, in cli_main
    click.echo(document.input_filename)
  File "/usr/lib/python3.11/site-packages/click/utils.py", line 318, in echo
    file.write(out)  # type: ignore
    ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 47, in write
    self.__convertor.write(text)
  File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 179, in write
    self.wrapped.write(text)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 15: surrogates not allowed

Tested on v0.6.0 on Arch Linux. I was unable to reproduce it via GUI as I was not able to select files with invalid UTF-8 sequences in the filename in the file dialog (such files are not displayed - possible Qt issue?). Update: GUI is not affected if passing the filename when starting dangerzone.

Since the filenames are printed at the end after all conversions, IIUC it does not impact the conversions, but the exception prevents the filename and and any subsequent filenames from being printed, and in case all documents were converted successfully it also impacts the exit code.

naglis added a commit to naglis/dangerzone that referenced this issue Apr 8, 2024
On Unix systems a filename can be a sequence of bytes that is not valid
UTF-8. Python uses[1] surrogate escapes to allow to decode such
filenames to Unicode (bytes that cannot be decoded are replaced by a
surrogate; upon encoding the surrogate is converted to the original
byte).

From `click` docs[2]:

> Invalid bytes or surrogate escapes will raise an error when written
> to a stream with `errors="strict"`. This will typically happen with
> `stdout` when the locale is something like `en_GB.UTF-8`.

To fix that, we use `click.format_filename`[2] before printing the
filenames to `stdout` so that surrogate escapes are replaced by �.

Fixes freedomofpress#768

[1]: https://peps.python.org/pep-0383/
[2]: https://click.palletsprojects.com/en/8.1.x/api/#click.format_filename
naglis added a commit to naglis/dangerzone that referenced this issue Apr 8, 2024
On Unix systems a filename can be a sequence of bytes that is not valid
UTF-8. Python uses[1] surrogate escapes to allow to decode such
filenames to Unicode (bytes that cannot be decoded are replaced by a
surrogate; upon encoding the surrogate is converted to the original
byte).

From `click` docs[2]:

> Invalid bytes or surrogate escapes will raise an error when written
> to a stream with `errors="strict"`. This will typically happen with
> `stdout` when the locale is something like `en_GB.UTF-8`.

To fix that, we use `click.format_filename`[2] before printing the
filenames to `stdout` so that surrogate escapes are replaced by �.

Fixes freedomofpress#768

[1]: https://peps.python.org/pep-0383/
[2]: https://click.palletsprojects.com/en/8.1.x/api/#click.format_filename
@apyrgio
Copy link
Contributor

apyrgio commented Apr 8, 2024

(such files are not displayed - possible Qt issue?)

Ok, that's unexpected.

Since the filenames are printed at the end after all conversions, IIUC it does not impact the conversions, but the exception prevents the filename and and any subsequent filenames from being printed, and in case all documents were converted successfully it also impacts the exit code.

Nice catch! Thanks a lot for reporting this. That was actually a security concern of ours, and we had added a special function to handle any character in a filename that is not printable (see cfa0c01). So, it seems that we didn't catch everything.

@naglis
Copy link
Contributor Author

naglis commented Apr 8, 2024

and we had added a special function to handle any character in a filename that is not printable (see cfa0c01). So, it seems that we didn't catch everything.

Indeed, I saw replace_control_chars and was wondering why it was not used in all cases where the filenames are printed, just in announce_id and was meaning to ask it in the PR, but ran into a few test failures. IIUC the issue here is writing surrogate escapes to a stream that does not support it, the circumstances are quite nicely summed up in click.format_filename documentation.

naglis added a commit to naglis/dangerzone that referenced this issue Apr 9, 2024
On Unix systems a filename can be a sequence of bytes that is not valid
UTF-8. Python uses[1] surrogate escapes to allow to decode such
filenames to Unicode (bytes that cannot be decoded are replaced by a
surrogate; upon encoding the surrogate is converted to the original
byte).

From `click` docs[2]:

> Invalid bytes or surrogate escapes will raise an error when written
> to a stream with `errors="strict"`. This will typically happen with
> `stdout` when the locale is something like `en_GB.UTF-8`.

To fix that, we use `click.format_filename`[2] before printing the
filenames to `stdout` so that surrogate escapes are replaced by �.

Fixes freedomofpress#768

[1]: https://peps.python.org/pep-0383/
[2]: https://click.palletsprojects.com/en/8.1.x/api/#click.format_filename
apyrgio added a commit to naglis/dangerzone that referenced this issue Apr 25, 2024
On Unix systems a filename can be a sequence of bytes that is not valid
UTF-8. Python uses[1] surrogate escapes to allow to decode such
filenames to Unicode (bytes that cannot be decoded are replaced by a
surrogate; upon encoding the surrogate is converted to the original
byte).

From `click` docs[2]:

> Invalid bytes or surrogate escapes will raise an error when written
> to a stream with `errors="strict"`. This will typically happen with
> `stdout` when the locale is something like `en_GB.UTF-8`.

To fix that, we use `utils.replace_control_chars()` before printing the
filenames to `stdout` so that surrogate escapes are replaced by �.

Fixes freedomofpress#768
apyrgio added a commit to naglis/dangerzone that referenced this issue Apr 25, 2024
On Unix systems a filename can be a sequence of bytes that is not valid
UTF-8. Python uses[1] surrogate escapes to allow to decode such
filenames to Unicode (bytes that cannot be decoded are replaced by a
surrogate; upon encoding the surrogate is converted to the original
byte).

From `click` docs[2]:

> Invalid bytes or surrogate escapes will raise an error when written
> to a stream with `errors="strict"`. This will typically happen with
> `stdout` when the locale is something like `en_GB.UTF-8`.

To fix that, we use `utils.replace_control_chars()` before printing the
filenames to `stdout` so that surrogate escapes are replaced by �.

Fixes freedomofpress#768
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants