-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeEncodeError when printing a filename with invalid UTF-8 after conversion #768
Comments
On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses[1] surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte). From `click` docs[2]: > Invalid bytes or surrogate escapes will raise an error when written > to a stream with `errors="strict"`. This will typically happen with > `stdout` when the locale is something like `en_GB.UTF-8`. To fix that, we use `click.format_filename`[2] before printing the filenames to `stdout` so that surrogate escapes are replaced by �. Fixes freedomofpress#768 [1]: https://peps.python.org/pep-0383/ [2]: https://click.palletsprojects.com/en/8.1.x/api/#click.format_filename
On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses[1] surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte). From `click` docs[2]: > Invalid bytes or surrogate escapes will raise an error when written > to a stream with `errors="strict"`. This will typically happen with > `stdout` when the locale is something like `en_GB.UTF-8`. To fix that, we use `click.format_filename`[2] before printing the filenames to `stdout` so that surrogate escapes are replaced by �. Fixes freedomofpress#768 [1]: https://peps.python.org/pep-0383/ [2]: https://click.palletsprojects.com/en/8.1.x/api/#click.format_filename
Ok, that's unexpected.
Nice catch! Thanks a lot for reporting this. That was actually a security concern of ours, and we had added a special function to handle any character in a filename that is not printable (see cfa0c01). So, it seems that we didn't catch everything. |
Indeed, I saw |
On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses[1] surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte). From `click` docs[2]: > Invalid bytes or surrogate escapes will raise an error when written > to a stream with `errors="strict"`. This will typically happen with > `stdout` when the locale is something like `en_GB.UTF-8`. To fix that, we use `click.format_filename`[2] before printing the filenames to `stdout` so that surrogate escapes are replaced by �. Fixes freedomofpress#768 [1]: https://peps.python.org/pep-0383/ [2]: https://click.palletsprojects.com/en/8.1.x/api/#click.format_filename
On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses[1] surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte). From `click` docs[2]: > Invalid bytes or surrogate escapes will raise an error when written > to a stream with `errors="strict"`. This will typically happen with > `stdout` when the locale is something like `en_GB.UTF-8`. To fix that, we use `utils.replace_control_chars()` before printing the filenames to `stdout` so that surrogate escapes are replaced by �. Fixes freedomofpress#768
On Unix systems a filename can be a sequence of bytes that is not valid UTF-8. Python uses[1] surrogate escapes to allow to decode such filenames to Unicode (bytes that cannot be decoded are replaced by a surrogate; upon encoding the surrogate is converted to the original byte). From `click` docs[2]: > Invalid bytes or surrogate escapes will raise an error when written > to a stream with `errors="strict"`. This will typically happen with > `stdout` when the locale is something like `en_GB.UTF-8`. To fix that, we use `utils.replace_control_chars()` before printing the filenames to `stdout` so that surrogate escapes are replaced by �. Fixes freedomofpress#768
If a filename contains invalid UTF-8 sequences,
dangerzone-cli
fails to print the filename of the successfully converted/failed document due to anUnicodeEncodeError
.Steps to reproduce (use bash for ANSI-C quoting):
Printing filenames of successfully converted documents:
Output
Printing filenames of documents that failed to convert:
Output
Tested on v0.6.0 on Arch Linux. I was unable to reproduce it via GUI as I was not able to select files with invalid UTF-8 sequences in the filename in the file dialog (such files are not displayed - possible Qt issue?). Update: GUI is not affected if passing the filename when starting
dangerzone
.Since the filenames are printed at the end after all conversions, IIUC it does not impact the conversions, but the exception prevents the filename and and any subsequent filenames from being printed, and in case all documents were converted successfully it also impacts the exit code.
The text was updated successfully, but these errors were encountered: