OsStrExt3 transmutes from an &[u8] to a OsStr #1524

BurntSushi · 2019-07-22T14:33:02Z

In this code:

Lines 23 to 32 in 784524f

    
           #[cfg(any(target_os = "windows", target_arch = "wasm32"))] 
        
           impl OsStrExt3 for OsStr { 
        
               fn from_bytes(b: &[u8]) -> &Self { 
        
                   use std::mem; 
        
                   unsafe { mem::transmute(b) } 
        
               } 
        
               fn as_bytes(&self) -> &[u8] { 
        
                   self.to_str().map(|s| s.as_bytes()).expect(INVALID_UTF8) 
        
               } 
        
           }

the transmute casts the &[u8] to a &OsStr. There are a couple problems with this:

This is not actually a safe thing to do, since &[u8] can be an arbitrary sequence of bytes, where as &OsStr cannot on Windows. On Windows, it internally is WTF-8 and it's not clear what, if anything, goes wrong when it isn't WTF-8. (But if it isn't WTF-8, then it could very well break a perfectly valid internal invariant that leads to UB.) A plausible alternative is to make from_bytes unsafe.
The fact that an &OsStr is internally a &[u8] on Windows that is WTF-8 is an implementation detail, and could actually change, leading to an incorrect transmute.

Is this code still present in clap 3? If so, could someone explain the motivation for this? I'd be happy to try to help brainstorm ways of removing it.

The text was updated successfully, but these errors were encountered:

Freaky · 2019-07-22T16:46:04Z

It's still present in v3-master. It's used to handle things like --path=bla and -ofoo.txt, hence trimming -'s and splitting at ='s or at specific positions.

These need redoing to use encode_wide()/decode_wide() I guess.

Out of interest, I just threw together this, which still assumes UTF/WTF-8, but at least tries to uphold the invariants. My editor crashed three times writing it but I'm sure it's fiOH GOD RAPTORS.

BurntSushi · 2019-07-22T17:12:32Z

Yeah, I just don't think transmuting like this is a good idea. WTF-8 is an internal detail, and having an important crate in the ecosystem rely it is just not a good idea. The simplest way around this isn't to use encode_wide/decode_wide (because then you'd have to do everything in UCS-2 space), but rather, to just lossily decode the OsStr to UTF-8. In the vast majority of cases, this would only entail a UTF-8 check on Windows, and in the rare case where you have a OsStr with invalid UTF-16, you incur the allocation.

Freaky · 2019-07-22T17:51:03Z

Without encode/decode_wide I don't see any way to handle it both correctly and safely. Lossy decoding should be right out, because it implies corrupting some valid program arguments, so we're just left with panicking on invalid UTF-8 on Windows.

BurntSushi · 2019-07-22T18:08:57Z

I am operating at a loss here, because I don't understand the context in which these routines are being used. Certainly, what you're saying is not generally true. For example, starts_with can be implemented on Windows by lossily decoded the OsStr to UTF-8. The only case you'd miss out on is when the starts_with parameter contains a WTF-8 encoding of invalid UTF-16. So the question then becomes, in what circumstances does the parameter to starts_with actually contain meaningful WTF-8? Ideally, the answer to that is "never."

Taking a step back and re-reading your comment above, maybe now I'm starting to realize here. Specifically, I guess the OsStr is the thing itself that you need to parse, so even if you did a lossy decode, that wouldn't help you, because you wouldn't be able to easily move back to OsStr when it comes time to hand the arguments back to the caller.

I think I now see the predicament, I think you're right. I see three possible choices:

Push on OsStr to get string-like methods. I think there was an RFC for it, but I'm not sure if there has been any movement on it.
Use encode_wide to get UCS-2 code units and parse those directly. Then use decode_wide to convert the arguments back to OsString.
Use encode_wide, re-implement WTF-8 and translate the UCS-2 code units to WTF-8, and then treat it as you are today. To get back, you then need to re-apply decode_wide to get OsString for every argument.

The latter two imply quite a bit of work, and at least some additional performance overhead. However, the performance overhead would only occur when the OsStr couldn't be translated to UTF-8. When it can be converted to UTF-8, then you'd just do that.

cc @SimonSapin - As the architect of WTF-8, what do you think the suggest path here should be? (I still continue to think the internal representation should just be exposed, despite the strong principles against doing so. The fact that we have people transmuting to the internal representation is going to make it de facto exposed eventually anyway.)

Freaky · 2019-07-23T16:28:34Z

As an exercise I tried implementing the trait methods safely and came up with this.

On Unix it just deals with byte slices, on Windows it tries converting to a str and falls back to a Vec<u16>, with the methods returning Cow<OsStr>.

BurntSushi · 2019-07-23T16:38:10Z

That's a lot more code, but on an initial skim, it looks good? It does seem unfortunate that everyone has to pay for the Cow though, although the cost should be pretty small.

SimonSapin · 2019-07-23T16:50:22Z

We have an RFC that changes the memory representation of OsStr on Windows: https://rust-lang.github.io/rfcs/2295-os-str-pattern.html / rust-lang/rust#49802. This RFC is accepted, but not implemented yet.

Officially exposing the byte representation would likely make this kind of change a breaking change. I would much prefer adding string-like methods to OsStr. In fact this is exactly what this RFC does. But indeed there hasn’t been much activity to implement since it was accepted a while ago.

…rs#1524)

Freaky · 2019-08-01T15:58:18Z

Had a go at integrating OsStrOps and came up with that. Can surely be improved, but passes the test suite on Windows and FreeBSD.

…rs#1524)

Freaky · 2019-08-01T16:13:53Z

Sorry for the force-pushes, clearing a few leftovers I missed.

dylni · 2019-12-01T23:07:19Z

If you're interested, I created OsStr Bytes to solve this problem. It allows accessing the bytes of OsStr and OsString safely, without making assumptions about how they're represented.

CreepySkeleton · 2020-02-01T13:04:05Z

Closing in favor of #1594

Freaky added a commit to Freaky/clap that referenced this issue Aug 1, 2019

First pass at replacing the unsafe transmuting OsStrExt3 trait (clap-…

36981aa

…rs#1524)

Freaky added a commit to Freaky/clap that referenced this issue Aug 1, 2019

First pass at replacing the unsafe transmuting OsStrExt3 trait (clap-…

2ffa008

…rs#1524)

Freaky added a commit to Freaky/clap that referenced this issue Aug 1, 2019

First pass at replacing the unsafe transmuting OsStrExt3 trait (clap-…

b246960

…rs#1524)

Freaky added a commit to Freaky/clap that referenced this issue Aug 1, 2019

First pass at replacing the unsafe transmuting OsStrExt3 trait (clap-…

8ed2b48

…rs#1524)

Freaky mentioned this issue Aug 1, 2019

Documentation should mention how to accept filenames as arguments. murarth/gumdrop#15

Open

CreepySkeleton closed this as completed Feb 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OsStrExt3 transmutes from an &[u8] to a OsStr #1524

OsStrExt3 transmutes from an &[u8] to a OsStr #1524

BurntSushi commented Jul 22, 2019 •

edited

Loading

Freaky commented Jul 22, 2019

BurntSushi commented Jul 22, 2019

Freaky commented Jul 22, 2019

BurntSushi commented Jul 22, 2019

Freaky commented Jul 23, 2019

BurntSushi commented Jul 23, 2019

SimonSapin commented Jul 23, 2019

Freaky commented Aug 1, 2019

Freaky commented Aug 1, 2019

dylni commented Dec 1, 2019

CreepySkeleton commented Feb 1, 2020

OsStrExt3 transmutes from an &[u8] to a OsStr #1524

OsStrExt3 transmutes from an &[u8] to a OsStr #1524

Comments

BurntSushi commented Jul 22, 2019 • edited Loading

Freaky commented Jul 22, 2019

BurntSushi commented Jul 22, 2019

Freaky commented Jul 22, 2019

BurntSushi commented Jul 22, 2019

Freaky commented Jul 23, 2019

BurntSushi commented Jul 23, 2019

SimonSapin commented Jul 23, 2019

Freaky commented Aug 1, 2019

Freaky commented Aug 1, 2019

dylni commented Dec 1, 2019

CreepySkeleton commented Feb 1, 2020

BurntSushi commented Jul 22, 2019 •

edited

Loading