Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

path: Windows paths may contain non-utf8-representable sequences #12056

Closed
lilyball opened this issue Feb 6, 2014 · 29 comments · Fixed by #21759
Closed

path: Windows paths may contain non-utf8-representable sequences #12056

lilyball opened this issue Feb 6, 2014 · 29 comments · Fixed by #21759
Labels
O-windows Operating system: Windows

Comments

@lilyball
Copy link
Contributor

lilyball commented Feb 6, 2014

It turns out that Windows paths may contain unpaired UTF-16 surrogates, which is not legally representable in UTF-8. This means that WindowsPath is not capable of representing such paths, as it uses ~str internally.

@SimonSapin
Copy link
Contributor

It turns out that Windows paths may contain unpaired UTF-16 surrogates

@kballard What is this based on? Trying to use CreateFileW directly to create such a file failed for me. Test case:

extern crate libc;

use std::ptr;
use std::io::IoError;

fn main() {
    for name in [&['a' as u16, 0xD83D, 0xDCA9, 'b' as u16, 0],
                 &['a' as u16, 0xD83D,         'b' as u16, 0]].iter() {
        let handle = unsafe {
            libc::CreateFileW(name.as_ptr(),
                              libc::FILE_GENERIC_WRITE,
                              0,
                              ptr::mut_null(),
                              libc::CREATE_ALWAYS,
                              libc::FILE_ATTRIBUTE_NORMAL,
                              ptr::mut_null())
        };
        let is_invalid = handle == libc::INVALID_HANDLE_VALUE as libc::HANDLE;
        println!("{} {} {}", name, is_invalid, IoError::last_error())
    }
}

Output:

[97, 55357, 56489, 98, 0] false unknown error (OS Error 0: The operation completed successfully.
)
[97, 55357, 98, 0] true unknown error (OS Error 87: The parameter is incorrect.
)

@lilyball
Copy link
Contributor Author

@SimonSapin I don't know the precise details, but there exist portions of Windows in which paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going to be an issue but at some point someone (and I wish I could remember who) showed me some output that showed that they were actually getting a UCS2 path from some Windows call and Path was unable to parse it.

@SimonSapin
Copy link
Contributor

Maybe CreateFileW is doing an explicit check that not all parts of the API are doing. sigh

@lilyball
Copy link
Contributor Author

My recollection is the bad path came from iterating a temporary directory on this person's filesystem. But I don't remember for certain.

@SimonSapin
Copy link
Contributor

I’d be interested to see a test case, because this seems like it would affect every other language or library that tries to use filenames as Unicode.

@lilyball
Copy link
Contributor Author

I'd love a test case too. Not being a Windows user is making it hard for me to actually test this stuff.

@SimonSapin
Copy link
Contributor

FWIW I used a virtual machine image from http://www.modern.ie/ and a nightly build to run the test above.

@klutzy
Copy link
Contributor

klutzy commented May 13, 2014

@kballard @SimonSapin #13338?
I think it depends on OS version and possibly even on locale. I ran your code on win7/kr and got:

[97, 55357, 56489, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15100))
[97, 55357, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15105))

@SimonSapin
Copy link
Contributor

@klutzy false for is_invalid and OS Error 0 looks like CreateFileW was successful. Do you see the two files created in the current directory?

@SimonSapin
Copy link
Contributor

Oh, wait, "success" here is not good news as it means you managed to create a file with a broken name. Does io::fs::readdir() trigger this failure, then?

@klutzy
Copy link
Contributor

klutzy commented May 13, 2014

@SimonSapin

Do you see the two files created in the current directory?
Does io::fs::readdir() trigger this failure, then?

Yes and yes. :'( I'm curious if it only occurs on recent OSes. (maybe >= 7?) Could you confirm OS name of VM you use?

@SimonSapin
Copy link
Contributor

Windows 7 Enterprise SP1, 32-bit, en-US locale

@klutzy
Copy link
Contributor

klutzy commented May 13, 2014

Ugh. I have no idea now.

Anyway, I'm ok to just ignore non-unicode filenames: I don't think there will be many use cases regarding it.

@lilyball
Copy link
Contributor Author

Perhaps. If it really is that rare, then maybe it's ok. It certainly simplifies things. And the programmer always has the option of dropping down and doing the system calls directly if they need to handle this.

But it makes me a bit uncomfortable regardless.

However, as I'm not a Windows user, I'll defer to Windows users on this subject.

@vadimcn
Copy link
Contributor

vadimcn commented Aug 30, 2014

Non-UTF-16 file names are certainly rare, but when you come across one... it's very annoying when you cannot use any standard tools to delete such a file, for example.
Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?

@SimonSapin
Copy link
Contributor

Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?

Yes, this would be terrible. Let’s not do that in the String and str types, it would not be UTF-8. We could have a NotReallyUtf8String type separately, but it would be of limited use.

@retep998
Copy link
Member

We could have a NotReallyUtf8String type separately

Wouldn't that just be called Path?

@SimonSapin
Copy link
Contributor

Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec<u16> internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)

@lilyball
Copy link
Contributor Author

lilyball commented Aug 30, 2014

Using UCS-2 means none of the *_str() methods would ever work. Encoding UCS-2 as an invalid UTF-8 as an internal implementation detail is something I considered a while back and is probably the best approach.

On Aug 30, 2014, at 2:13 AM, Simon Sapin [email protected] wrote:

Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)


Reply to this email directly or view it on GitHub.

@retep998
Copy link
Member

retep998 commented Sep 6, 2014

For comparison, MSVC C++ when converting an std::path to a utf-8 string will encode UCS-2 as invalid UTF-8, so there is precedent for us to follow this route.

@SimonSapin
Copy link
Contributor

I suggest calling this WTF-8: an encoding that uses the same algorithm as UTF-8, but has the "value space" of UCS-2. (That is, bigger than Unicode since unpaired surrogates are allowed.) We can decide later what the name an acronym for.

Note: I call UCS-2 here any sequence of 16 bit units. Surrogate pairs have a special meaning, but unpaired surrogates are allowed.

To convert UCS-2 to WTF-8:

  • Valid surrogate pairs are interpreted as a non-BMP code point and encoded as a 4-byte UTF-8 sequence
  • Any other 16 bit code unit, including lone surrogates, are encoded as sequence of 1 to 3 bytes using UTF-8’s algorithm. This is invalid UTF-8 for lone surrogates, but valid WTF-8.

To convert WTF-8 to UCS-2:

  • 4-byte sequences are interpreted as a non-BMP code point and encoded as a surrogate pair of 16 bit units
  • 1 to 3 bytes sequences in UTF-8’s algorithm, including surrogates, are encoded as a one 16 bit unit

Consecutive 3-byte sequences for a lead surrogate followed by a trail surrogate are invalid in WTF-8. A 4-byte sequence should be used instead. (This ensures that the WTF-8 encoding of any UCS-2 string is unique.)

WTF-8 has the same "value space" as UCS-2, which is bigger than Unicode (since they include unpaired surrogates.)

Any valid UTF-8 string is also valid WTF-8, with the same byte representation. A WTF-8 string is also valid UTF-8 if and only if its UCS-2 conversion is valid UTF-16.

To convert a WTF-8 to UTF-8, either:

  • Strictly: return an error if the string contains a 3-byte sequence for a surrogate code point, otherwise return the string unchanged
  • Lossily: replace any 3-byte sequence for a surrogate code point by 0xEF 0xBF 0xBD, the UTF-8 representation of the replacement character U+FFFD.

To concatenate two WTF-8 strings: if the earlier one ends with a lead surrogate and the latter one starts with a trail surrogate, both surrogate need to be removed and replaced with a 4-byte sequence.

Note: WTF-8 is different from CESU-8.

In terms of Rust implementation, WTF-8 data should be kept in a dedicated type that wraps a private Vec<u8> field, with APIs that maintain the encoding invariants, like String does for UTF-8.

@SimonSapin
Copy link
Contributor

I’m gonna work of WTF-8 anyway, for Servo to interact with JavaScript. ECMAScript clearly says that strings are sequences of 16-bit integers.

For Windows however, it’s not so clear to me that this is actually needed. MSDN claims that the encoding used in Windows is UTF-16. The documentation for functions converting to and from UTF-16 says:

Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16. The function used on earlier operating systems encodes or decodes lone surrogate halves or mismatched surrogate pairs.

I believe (by opposition to the following sentence) that "fully conforms" here means replaces unpaired surrogates with U+FFFD. (I haven’t tested it, though.)

Now, we’ve seen that it’s possible to create a file with an invalid UTF-16 name. But it’s not easy, we’re sometimes prevented from doing it. It may be considered a bug that it was possible. The question is, and I couldn’t find an answer on MSDN, are Windows applications expected to handle such files correctly?

@retep998
Copy link
Member

According to MSDN

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs.

This isn't exactly UTF-16. So while the rest of WinAPI uses UTF-16, the file API doesn't. I've tested creating filenames with invalid surrogates in their name and I was able to create them and manipulate them easily through both code, and numerous applications including cmd.exe Notepad and Windows explorer. So far Windows 8.1 hasn't tried to stop me from using invalid surrogates for files, so I definitely think Windows applications are expected to handle such files correctly.

@SimonSapin
Copy link
Contributor

"Unicode normalization" refers to something else, but "opaque sequence of WCHARs" does indeed imply that unpaired surrogates can occur. Alright then, WTF-8 for Path it is.

@vadimcn
Copy link
Contributor

vadimcn commented Sep 27, 2014

So why use WTF-8 for internal representation, and not UCS-2 then? If WTFString is a distinct type from String, you might as well save on encoding/decoding them when interacting with Windows APIs.

@SimonSapin
Copy link
Contributor

As noted by @kballard above, none of the *_str methods would work with UCS-2 internally.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Feb 4, 2015
This PR implements [path reform](rust-lang/rfcs#474), and motivation and details for the change can be found there.

For convenience, the old path API is being kept as `old_path` for the time being. Updating after this PR is just a matter of changing imports to `old_path` (which is likely not needed, since the prelude entries still export the old path API).

This initial PR does not include additional normalization or platform-specific path extensions. These will be done in follow up commits or PRs.

[breaking-change]

Closes rust-lang#20034
Closes rust-lang#12056
Closes rust-lang#11594
Closes rust-lang#14028
Closes rust-lang#14049
Closes rust-lang#10035
bors bot added a commit to cross-rs/cross that referenced this issue Jun 21, 2022
828: Avoid using display on paths. r=Emilgardis a=Alexhuszagh

Adds a helper trait `ToUtf8` which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases.

On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common).

So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character.

Co-authored-by: Alex Huszagh <[email protected]>
bors bot added a commit to cross-rs/cross that referenced this issue Jun 21, 2022
828: Avoid using display on paths. r=Emilgardis a=Alexhuszagh

Adds a helper trait `ToUtf8` which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases.

On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common).

So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character.

Co-authored-by: Alex Huszagh <[email protected]>
flip1995 pushed a commit to flip1995/rust that referenced this issue Jan 11, 2024
Fixes: rust-lang#12050 - `identity_op` correctly suggests a deference for coerced references

When `identity_op` identifies a `no_op`, provides a suggestion, it also checks the type of the type of the variable. If the variable is a reference that's been coerced into a value, e.g.

```
let x = &0i32;
let _ = x + 0;
```

the suggestion will now use a derefence. This is done by identifying whether the variable is a reference to an integral value, and then whether it gets dereferenced.

changelog: false positive: [`identity_op`]: corrected suggestion for reference coerced to value.

fixes: rust-lang#12050
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this issue Aug 19, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
At the moment, we work around this problem with things like the ddio
module. If we make sure that we always use std::filesystem::paths, then
we’ll be able to get rid of or simplify a lot of the ddio module.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this issue Aug 20, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this issue Aug 20, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this issue Aug 20, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this issue Aug 22, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this issue Sep 12, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>

TODO:
• Test on Windows.
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this issue Sep 12, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this issue Sep 22, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this issue Sep 29, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-windows Operating system: Windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants