Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-valid UTF-8 strings #6

Open
ismaelgv opened this issue Jul 12, 2018 · 2 comments
Open

Support for non-valid UTF-8 strings #6

ismaelgv opened this issue Jul 12, 2018 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@ismaelgv
Copy link
Owner

Extend code to support non-valid UTF-8 strings in filenames, paths and arguments:

  • Use OsStr and OsString.
  • Follow OsStr pattern API extension in Rust repository.
  • Check issues with current crates: clap, regex, walkdir and ansi_term
@ismaelgv ismaelgv added the enhancement New feature or request label Jul 12, 2018
@ismaelgv ismaelgv added this to the long-term milestone Jul 12, 2018
@ismaelgv ismaelgv modified the milestones: long-term, 0.2 Aug 1, 2018
@ismaelgv
Copy link
Owner Author

ismaelgv commented Aug 1, 2018

Right now it is not possible to convert OsStr(ing) to &[u8] on Windows to be used in regex::bytes::Regex::replace without losing information. For example, ripgrep uses a to_string_lossy conversion to obtain a &[u8] in Windows.

@BurntSushi
Copy link

Yeah, this is something I've always wondered about. So far, I haven't had anyone complain about cases where information is lost, i.e., when there's an invalid UTF-16 file path on Windows. One presumes that this might be so infrequent that it may not be a blocking problem in practice.

Getting a real fix for this is tricky. One possibility is to use the underlying representation of an OsStr (which is WTF-8), but this is not part of the public API. Another possibility is to re-create WTF-8 decoding outside of std using the Windows version of the OsStrExt trait. But this incurs a second WTF-8 decoding step, however, it's no worse than the lossy UTF-8 decoding that I'm already doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants