-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
path: Windows paths may contain non-utf8-representable sequences #12056
Comments
@kballard What is this based on? Trying to use extern crate libc;
use std::ptr;
use std::io::IoError;
fn main() {
for name in [&['a' as u16, 0xD83D, 0xDCA9, 'b' as u16, 0],
&['a' as u16, 0xD83D, 'b' as u16, 0]].iter() {
let handle = unsafe {
libc::CreateFileW(name.as_ptr(),
libc::FILE_GENERIC_WRITE,
0,
ptr::mut_null(),
libc::CREATE_ALWAYS,
libc::FILE_ATTRIBUTE_NORMAL,
ptr::mut_null())
};
let is_invalid = handle == libc::INVALID_HANDLE_VALUE as libc::HANDLE;
println!("{} {} {}", name, is_invalid, IoError::last_error())
}
} Output:
|
@SimonSapin I don't know the precise details, but there exist portions of Windows in which paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going to be an issue but at some point someone (and I wish I could remember who) showed me some output that showed that they were actually getting a UCS2 path from some Windows call and |
Maybe |
My recollection is the bad path came from iterating a temporary directory on this person's filesystem. But I don't remember for certain. |
I’d be interested to see a test case, because this seems like it would affect every other language or library that tries to use filenames as Unicode. |
I'd love a test case too. Not being a Windows user is making it hard for me to actually test this stuff. |
FWIW I used a virtual machine image from http://www.modern.ie/ and a nightly build to run the test above. |
@kballard @SimonSapin #13338?
|
@klutzy |
Oh, wait, "success" here is not good news as it means you managed to create a file with a broken name. Does |
Yes and yes. :'( I'm curious if it only occurs on recent OSes. (maybe >= 7?) Could you confirm OS name of VM you use? |
Windows 7 Enterprise SP1, 32-bit, en-US locale |
Ugh. I have no idea now. Anyway, I'm ok to just ignore non-unicode filenames: I don't think there will be many use cases regarding it. |
Perhaps. If it really is that rare, then maybe it's ok. It certainly simplifies things. And the programmer always has the option of dropping down and doing the system calls directly if they need to handle this. But it makes me a bit uncomfortable regardless. However, as I'm not a Windows user, I'll defer to Windows users on this subject. |
Non-UTF-16 file names are certainly rare, but when you come across one... it's very annoying when you cannot use any standard tools to delete such a file, for example. |
Yes, this would be terrible. Let’s not do that in the |
Wouldn't that just be called |
Maybe it wouldn’t be so terrible if it’s an internal detail of |
Using UCS-2 means none of the *_str() methods would ever work. Encoding UCS-2 as an invalid UTF-8 as an internal implementation detail is something I considered a while back and is probably the best approach.
|
For comparison, MSVC C++ when converting an |
I suggest calling this WTF-8: an encoding that uses the same algorithm as UTF-8, but has the "value space" of UCS-2. (That is, bigger than Unicode since unpaired surrogates are allowed.) We can decide later what the name an acronym for. Note: I call UCS-2 here any sequence of 16 bit units. Surrogate pairs have a special meaning, but unpaired surrogates are allowed. To convert UCS-2 to WTF-8:
To convert WTF-8 to UCS-2:
Consecutive 3-byte sequences for a lead surrogate followed by a trail surrogate are invalid in WTF-8. A 4-byte sequence should be used instead. (This ensures that the WTF-8 encoding of any UCS-2 string is unique.) WTF-8 has the same "value space" as UCS-2, which is bigger than Unicode (since they include unpaired surrogates.) Any valid UTF-8 string is also valid WTF-8, with the same byte representation. A WTF-8 string is also valid UTF-8 if and only if its UCS-2 conversion is valid UTF-16. To convert a WTF-8 to UTF-8, either:
To concatenate two WTF-8 strings: if the earlier one ends with a lead surrogate and the latter one starts with a trail surrogate, both surrogate need to be removed and replaced with a 4-byte sequence. Note: WTF-8 is different from CESU-8. In terms of Rust implementation, WTF-8 data should be kept in a dedicated type that wraps a private |
I’m gonna work of WTF-8 anyway, for Servo to interact with JavaScript. ECMAScript clearly says that strings are sequences of 16-bit integers. For Windows however, it’s not so clear to me that this is actually needed. MSDN claims that the encoding used in Windows is UTF-16. The documentation for functions converting to and from UTF-16 says:
I believe (by opposition to the following sentence) that "fully conforms" here means replaces unpaired surrogates with U+FFFD. (I haven’t tested it, though.) Now, we’ve seen that it’s possible to create a file with an invalid UTF-16 name. But it’s not easy, we’re sometimes prevented from doing it. It may be considered a bug that it was possible. The question is, and I couldn’t find an answer on MSDN, are Windows applications expected to handle such files correctly? |
According to MSDN
This isn't exactly UTF-16. So while the rest of WinAPI uses UTF-16, the file API doesn't. I've tested creating filenames with invalid surrogates in their name and I was able to create them and manipulate them easily through both code, and numerous applications including cmd.exe Notepad and Windows explorer. So far Windows 8.1 hasn't tried to stop me from using invalid surrogates for files, so I definitely think Windows applications are expected to handle such files correctly. |
"Unicode normalization" refers to something else, but "opaque sequence of WCHARs" does indeed imply that unpaired surrogates can occur. Alright then, WTF-8 for Path it is. |
So why use WTF-8 for internal representation, and not UCS-2 then? If WTFString is a distinct type from String, you might as well save on encoding/decoding them when interacting with Windows APIs. |
As noted by @kballard above, none of the |
This PR implements [path reform](rust-lang/rfcs#474), and motivation and details for the change can be found there. For convenience, the old path API is being kept as `old_path` for the time being. Updating after this PR is just a matter of changing imports to `old_path` (which is likely not needed, since the prelude entries still export the old path API). This initial PR does not include additional normalization or platform-specific path extensions. These will be done in follow up commits or PRs. [breaking-change] Closes rust-lang#20034 Closes rust-lang#12056 Closes rust-lang#11594 Closes rust-lang#14028 Closes rust-lang#14049 Closes rust-lang#10035
828: Avoid using display on paths. r=Emilgardis a=Alexhuszagh Adds a helper trait `ToUtf8` which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases. On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common). So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character. Co-authored-by: Alex Huszagh <[email protected]>
828: Avoid using display on paths. r=Emilgardis a=Alexhuszagh Adds a helper trait `ToUtf8` which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases. On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common). So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character. Co-authored-by: Alex Huszagh <[email protected]>
Fixes: rust-lang#12050 - `identity_op` correctly suggests a deference for coerced references When `identity_op` identifies a `no_op`, provides a suggestion, it also checks the type of the type of the variable. If the variable is a reference that's been coerced into a value, e.g. ``` let x = &0i32; let _ = x + 0; ``` the suggestion will now use a derefence. This is done by identifying whether the variable is a reference to an integral value, and then whether it gets dereferenced. changelog: false positive: [`identity_op`]: corrected suggestion for reference coerced to value. fixes: rust-lang#12050
Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. At the moment, we work around this problem with things like the ddio module. If we make sure that we always use std::filesystem::paths, then we’ll be able to get rid of or simplify a lot of the ddio module. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>
Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>
Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>
Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>
Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>
Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056> TODO: • Test on Windows.
Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>
Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>
Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>
It turns out that Windows paths may contain unpaired UTF-16 surrogates, which is not legally representable in UTF-8. This means that
WindowsPath
is not capable of representing such paths, as it uses~str
internally.The text was updated successfully, but these errors were encountered: