Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OsStrExt3 transmutes from an &[u8] to a OsStr #1524

Closed
BurntSushi opened this issue Jul 22, 2019 · 11 comments
Closed

OsStrExt3 transmutes from an &[u8] to a OsStr #1524

BurntSushi opened this issue Jul 22, 2019 · 11 comments

Comments

@BurntSushi
Copy link
Contributor

BurntSushi commented Jul 22, 2019

In this code:

clap/src/osstringext.rs

Lines 23 to 32 in 784524f

#[cfg(any(target_os = "windows", target_arch = "wasm32"))]
impl OsStrExt3 for OsStr {
fn from_bytes(b: &[u8]) -> &Self {
use std::mem;
unsafe { mem::transmute(b) }
}
fn as_bytes(&self) -> &[u8] {
self.to_str().map(|s| s.as_bytes()).expect(INVALID_UTF8)
}
}

the transmute casts the &[u8] to a &OsStr. There are a couple problems with this:

  1. This is not actually a safe thing to do, since &[u8] can be an arbitrary sequence of bytes, where as &OsStr cannot on Windows. On Windows, it internally is WTF-8 and it's not clear what, if anything, goes wrong when it isn't WTF-8. (But if it isn't WTF-8, then it could very well break a perfectly valid internal invariant that leads to UB.) A plausible alternative is to make from_bytes unsafe.
  2. The fact that an &OsStr is internally a &[u8] on Windows that is WTF-8 is an implementation detail, and could actually change, leading to an incorrect transmute.

Is this code still present in clap 3? If so, could someone explain the motivation for this? I'd be happy to try to help brainstorm ways of removing it.

@Freaky
Copy link

Freaky commented Jul 22, 2019

It's still present in v3-master. It's used to handle things like --path=bla and -ofoo.txt, hence trimming -'s and splitting at ='s or at specific positions.

These need redoing to use encode_wide()/decode_wide() I guess.

Out of interest, I just threw together this, which still assumes UTF/WTF-8, but at least tries to uphold the invariants. My editor crashed three times writing it but I'm sure it's fiOH GOD RAPTORS.

@BurntSushi
Copy link
Contributor Author

Yeah, I just don't think transmuting like this is a good idea. WTF-8 is an internal detail, and having an important crate in the ecosystem rely it is just not a good idea. The simplest way around this isn't to use encode_wide/decode_wide (because then you'd have to do everything in UCS-2 space), but rather, to just lossily decode the OsStr to UTF-8. In the vast majority of cases, this would only entail a UTF-8 check on Windows, and in the rare case where you have a OsStr with invalid UTF-16, you incur the allocation.

@Freaky
Copy link

Freaky commented Jul 22, 2019

Without encode/decode_wide I don't see any way to handle it both correctly and safely. Lossy decoding should be right out, because it implies corrupting some valid program arguments, so we're just left with panicking on invalid UTF-8 on Windows.

@BurntSushi
Copy link
Contributor Author

I am operating at a loss here, because I don't understand the context in which these routines are being used. Certainly, what you're saying is not generally true. For example, starts_with can be implemented on Windows by lossily decoded the OsStr to UTF-8. The only case you'd miss out on is when the starts_with parameter contains a WTF-8 encoding of invalid UTF-16. So the question then becomes, in what circumstances does the parameter to starts_with actually contain meaningful WTF-8? Ideally, the answer to that is "never."

Taking a step back and re-reading your comment above, maybe now I'm starting to realize here. Specifically, I guess the OsStr is the thing itself that you need to parse, so even if you did a lossy decode, that wouldn't help you, because you wouldn't be able to easily move back to OsStr when it comes time to hand the arguments back to the caller.

I think I now see the predicament, I think you're right. I see three possible choices:

  1. Push on OsStr to get string-like methods. I think there was an RFC for it, but I'm not sure if there has been any movement on it.
  2. Use encode_wide to get UCS-2 code units and parse those directly. Then use decode_wide to convert the arguments back to OsString.
  3. Use encode_wide, re-implement WTF-8 and translate the UCS-2 code units to WTF-8, and then treat it as you are today. To get back, you then need to re-apply decode_wide to get OsString for every argument.

The latter two imply quite a bit of work, and at least some additional performance overhead. However, the performance overhead would only occur when the OsStr couldn't be translated to UTF-8. When it can be converted to UTF-8, then you'd just do that.

cc @SimonSapin - As the architect of WTF-8, what do you think the suggest path here should be? (I still continue to think the internal representation should just be exposed, despite the strong principles against doing so. The fact that we have people transmuting to the internal representation is going to make it de facto exposed eventually anyway.)

@Freaky
Copy link

Freaky commented Jul 23, 2019

As an exercise I tried implementing the trait methods safely and came up with this.

On Unix it just deals with byte slices, on Windows it tries converting to a str and falls back to a Vec<u16>, with the methods returning Cow<OsStr>.

@BurntSushi
Copy link
Contributor Author

That's a lot more code, but on an initial skim, it looks good? It does seem unfortunate that everyone has to pay for the Cow though, although the cost should be pretty small.

@SimonSapin
Copy link

We have an RFC that changes the memory representation of OsStr on Windows: https://rust-lang.github.io/rfcs/2295-os-str-pattern.html / rust-lang/rust#49802. This RFC is accepted, but not implemented yet.

Officially exposing the byte representation would likely make this kind of change a breaking change. I would much prefer adding string-like methods to OsStr. In fact this is exactly what this RFC does. But indeed there hasn’t been much activity to implement since it was accepted a while ago.

@Freaky
Copy link

Freaky commented Aug 1, 2019

Had a go at integrating OsStrOps and came up with that. Can surely be improved, but passes the test suite on Windows and FreeBSD.

@Freaky
Copy link

Freaky commented Aug 1, 2019

Sorry for the force-pushes, clearing a few leftovers I missed.

@dylni
Copy link
Contributor

dylni commented Dec 1, 2019

If you're interested, I created OsStr Bytes to solve this problem. It allows accessing the bytes of OsStr and OsString safely, without making assumptions about how they're represented.

@CreepySkeleton
Copy link
Contributor

Closing in favor of #1594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants