Allow limited access to `OsStr` bytes #109698

epage · 2023-03-28T13:34:51Z

OsStr has historically kept its implementation details private out of
concern for locking us into a specific encoding on Windows.

This is an alternative to #95290 which proposed specifying the encoding on Windows. Instead, this
only specifies that for cross-platform code, OsStr's encoding is a superset of UTF-8 and defines
rules for safely interacting with it

At minimum, this can greatly simplify the os_str_bytes crate and every
arg parser that interacts with OsStr directly (which is most of those
that support invalid UTF-8).

Tracking issue: #111544

`OsStr` has historically kept its implementation details private out of concern for locking us into a specific encoding on Windows. This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows. Instead, this only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines rules for safely interacting with it At minimum, this can greatly simplify the `os_str_bytes` crate and every arg parser that interacts with `OsStr` directly (which is most of those that support invalid UTF-8).

rustbot · 2023-03-28T13:35:05Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @cuviper (or someone else) soon.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

@rustbot author: the review is finished, PR author should check the comments and take action accordingly
@rustbot review: the author is ready for a review, this PR will be queued again in the reviewer's queue

rustbot · 2023-03-28T13:35:09Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

cuviper · 2023-03-28T14:27:47Z

I think this counts as new guarantee for API purposes.

@rustbot label +T-libs-api -T-libs
r? libs-api

ChrisDenton · 2023-03-28T14:42:10Z

Out of interest, was there anything in particular that motivated you to propose this now?

the8472 · 2023-03-28T21:04:02Z

I think this should come with an example since it's a wide pointer. Conversion from C to Rust has to jump through extra hoops.

joshtriplett · 2023-03-29T08:03:26Z

I'm personally amenable to adding this guarantee, but I do agree with @the8472 that we need to confirm that this will actually work as intended.

epage · 2023-03-31T02:17:33Z

Sorry for the lack of details and confusion on this. I was under the impression that working towards my final goal would best be done in smaller steps of defining OsStr but that meant I didn't have a use case for it which I recognize can be frustrating to deal with when evaluating a major change like this. I've since updated the documentation and PR to reflect the minimum of what I need for my use case: clap and other CLI tools that need to deal with parsing and splitting up OsStrs for however much valid UTF-8 is in them.

My hope that this more limited alternative to #95290 will have a chance to move forward.

ChrisDenton · 2023-03-31T15:00:39Z

If we do this, I wonder if it makes sense to also offer a split_at method on OsStr that panics if used incorrectly. I.e.:

// Similar to `str::split_at` and `[T]::split_at`
// Panics if `mid` is not on a UTF-8 code point boundary.
fn split_at(&self, mid: usize) -> (&OsStr, &OsStr);

Splitting and joining are the two hazard areas when dealing with known valid OsStrs (whatever form they're in). We have OsString::push for the joining case but nothing to help with splitting. I think it would be unfortunate if we don't have some safe way to check you're doing it right.

epage · 2023-03-31T15:16:04Z

but nothing to help with splitting. I think it would be unfortunate if we don't have some safe way to check you're doing it right.

A part of me finds it weird to offer a safe function like split_at that takes a parameter (mid) that can only be calculated from another type derived from with unsafe code. Documenting use of split_at would then further raise visibility of this feature, for both good (people getting stuff done) and bad (risk of misuse). However, I do see the benefit as that does greatly reduce the risk of something going wrong. In my experiments with transmuting &OsStr, I wrapped all byte operations in functions that only allowed deriving mid from &str parameters (e.g. OsStrExt::strip_prefix(&self, prefix: &str)) so I knew I was splitting at a safe boundary.

EDIT: To make clear, I'm open to adding that function if that is the direction we want to take this.

ChrisDenton · 2023-03-31T15:46:53Z

Hm, we could offer some safe way to get an index according to these rules. E.g.

// Returns the byte index where `searcher` first returns true.
// `searcher` can inspect `char`s decoded from UTF-8.
// Non UTF-8 encoded characters are skipped.
fn find_char(&self, searcher: FnMut(char) -> bool) -> Option<usize>;

But at this point I'll stop because I'm practically designing yet another alternative approach.

chorman0773 · 2023-04-11T16:11:02Z

Does this confirm that OsStr is a "bag of bytes" on any platform, in that you can put and leave any (init, non-pointer) bytes in the slice, or am I misinterpreting the proposal?

epage · 2023-04-11T16:16:47Z

Does this confirm that OsStr is a "bag of bytes" on any platform, in that you can put and leave any (init, non-pointer) bytes in the slice, or am I misinterpreting the proposal?

For now, this proposals restricts what is safe to transmute from &[u8] to &OsStr, so this does not work as an arbitrary bag of bytes

//! - When [transmuting] from `&[u8]` to `&OsStr`,
//!   - the slice may only include content from comparable `&OsStr` (see above) or be valid UTF-8
//!   - any splits of the `&OsStr` must be along char boundaries (the first byte of a UTF-8 code
//!     point sequence)

BurntSushi · 2023-04-11T16:59:38Z

I think I buy the rules as written, but I would like to see a doc example using unsafe, along with a SAFETY comment justifying why it's OK. I think that would also help others write correct code.

Also, cc @SimonSapin to see what you think about this. (Let me know if you want me to stop pinging you about this, but I always think of you as the champion against a change like this.)

Popping up a level, this is kind of an interesting use of unsafe, isn't it? It's carving out a strictly more conservative set of things the caller needs to uphold than what is actually true. I wonder, for example, how Miri might detect UB in a case like this.

blyxxyz · 2023-04-11T17:21:43Z

Does this confirm that OsStr is a "bag of bytes" on any platform, in that you can put and leave any (init, non-pointer) bytes in the slice, or am I misinterpreting the proposal?

It actually isn't a bag of bytes on Windows, it must be WTF-8-encoded or you can start reading out of bounds:

fn main() {
    let b: &[u8] = b"\xC2";
    let s: &std::ffi::OsStr = unsafe { std::mem::transmute(b) };
    dbg!(s);
}

$ cargo run -q --target x86_64-pc-windows-gnu
[src/main.rs:4] s = "\u{9b}:]  = \n\0\0\0\0\0\0\01\u{10}L[...]thread 'main' panicked at 'failed printing to stderr: Windows stdio in console mode does not support writing non-UTF-8 byte sequences', library\std\src\io\stdio.rs:1008:9

The proposal is phrased to never produce invalid WTF-8.

I think that as written now it allows for any OsStr encoding that's a self-synchronizing superset of UTF-8.

asquared31415 · 2023-04-11T18:02:47Z

When [transmuting] from &[u8] to &OsStr,

This wording would provide an additional guarantee that &[u8] and &OsStr have the same layout in terms of their (ptr, len) pairs. As far as I am aware, this is something that is not currently guaranteed, and I don't recall the current stance on guaranteeing layout compatibility between any two fat pointer types that have different pointees.

epage · 2023-05-30T00:12:12Z

I've addressed what caused the wasm build failure and it should be good to go again.

Amanieu · 2023-05-30T16:23:06Z

@bors r+

bors · 2023-05-30T16:23:08Z

📌 Commit e6a35c4 has been approved by Amanieu

It is now in the queue for this repository.

Allow limited access to `OsStr` bytes `OsStr` has historically kept its implementation details private out of concern for locking us into a specific encoding on Windows. This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows. Instead, this only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines rules for safely interacting with it At minimum, this can greatly simplify the `os_str_bytes` crate and every arg parser that interacts with `OsStr` directly (which is most of those that support invalid UTF-8). Tracking issue: rust-lang#111544

bors · 2023-05-30T21:26:25Z

⌛ Testing commit e6a35c4 with merge 9610dfe...

bors · 2023-05-31T00:24:33Z

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing 9610dfe to master...

rust-timer · 2023-05-31T01:41:03Z

Finished benchmarking commit (9610dfe): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-3.5%	[-4.6%, -2.5%]	2
All ❌✅ (primary)	-	-	0

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.0%	[0.0%, 0.1%]	5
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.1%	[-0.1%, -0.1%]	3
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.0%	[-0.1%, 0.1%]	8

Bootstrap: 642.846s -> 643.715s (0.14%)

This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.

Allow limited access to `OsString` bytes This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.

rustbot assigned cuviper Mar 28, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 28, 2023

epage force-pushed the wtf branch 2 times, most recently from 01f8d93 to 0d87d66 Compare March 28, 2023 14:26

rustbot added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 28, 2023

rustbot assigned Amanieu and unassigned cuviper Mar 28, 2023

epage force-pushed the wtf branch from 0d87d66 to 7f59b50 Compare March 31, 2023 02:11

epage changed the title ~~Allow FFI support for OsStr~~ Allow limited access to OsStr bytes in unsafe blocks Mar 31, 2023

epage force-pushed the wtf branch from 7f59b50 to bd862e8 Compare March 31, 2023 02:18

epage mentioned this pull request Apr 7, 2023

feat: Add --file-list <PATH> to choose files to check crate-ci/typos#708

Merged

Amanieu added the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label Apr 8, 2023

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 28, 2023

epage force-pushed the wtf branch 2 times, most recently from e2d912b to e6a35c4 Compare May 29, 2023 12:57

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 30, 2023

Dylan-DPC mentioned this pull request May 30, 2023

Rollup of 6 pull requests #112111

Closed

bors added the merged-by-bors This PR was explicitly merged by bors. label May 31, 2023

bors merged commit 9610dfe into rust-lang:master May 31, 2023

rustbot added this to the 1.72.0 milestone May 31, 2023

dtolnay mentioned this pull request May 31, 2023

ACP: std::ffi::c and std::ffi::os submodules rust-lang/libs-team#134

Closed

apiraino removed the to-announce Announce this issue on triage meeting label Jun 15, 2023

epage mentioned this pull request Jul 7, 2023

Allow limited access to OsString bytes #113442

Merged

epage added a commit to epage/rust that referenced this pull request Jul 7, 2023

Allow limited access to OsString bytes

ee604fc

This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.

epage deleted the wtf branch July 22, 2023 14:48

This was referenced Nov 12, 2023

Unsafe code - Rust native support available now (?) clap-rs/clap#5209

Closed

[CLI] Refactor all CLI entrypoints to i.e. Clap stacks-network/stacks-core#4045

Open

epage mentioned this pull request Dec 7, 2023

ACP: Pattern methods for OsStr without OsStr patterns rust-lang/libs-team#311

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow limited access to `OsStr` bytes #109698

Allow limited access to `OsStr` bytes #109698

epage commented Mar 28, 2023 •

edited

Loading

rustbot commented Mar 28, 2023

rustbot commented Mar 28, 2023

cuviper commented Mar 28, 2023

ChrisDenton commented Mar 28, 2023

the8472 commented Mar 28, 2023

joshtriplett commented Mar 29, 2023

epage commented Mar 31, 2023 •

edited

Loading

ChrisDenton commented Mar 31, 2023

epage commented Mar 31, 2023 •

edited

Loading

ChrisDenton commented Mar 31, 2023

chorman0773 commented Apr 11, 2023

epage commented Apr 11, 2023 •

edited

Loading

BurntSushi commented Apr 11, 2023

blyxxyz commented Apr 11, 2023

asquared31415 commented Apr 11, 2023 •

edited

Loading

epage commented May 30, 2023

Amanieu commented May 30, 2023

bors commented May 30, 2023

bors commented May 30, 2023

bors commented May 31, 2023

rust-timer commented May 31, 2023

Allow limited access to OsStr bytes #109698

Allow limited access to OsStr bytes #109698

Conversation

epage commented Mar 28, 2023 • edited Loading

rustbot commented Mar 28, 2023

rustbot commented Mar 28, 2023

cuviper commented Mar 28, 2023

ChrisDenton commented Mar 28, 2023

the8472 commented Mar 28, 2023

joshtriplett commented Mar 29, 2023

epage commented Mar 31, 2023 • edited Loading

ChrisDenton commented Mar 31, 2023

epage commented Mar 31, 2023 • edited Loading

ChrisDenton commented Mar 31, 2023

chorman0773 commented Apr 11, 2023

epage commented Apr 11, 2023 • edited Loading

BurntSushi commented Apr 11, 2023

blyxxyz commented Apr 11, 2023

asquared31415 commented Apr 11, 2023 • edited Loading

epage commented May 30, 2023

Amanieu commented May 30, 2023

bors commented May 30, 2023

bors commented May 30, 2023

bors commented May 31, 2023

rust-timer commented May 31, 2023

Overall result: no relevant changes - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Binary size

Allow limited access to `OsStr` bytes #109698

Allow limited access to `OsStr` bytes #109698

epage commented Mar 28, 2023 •

edited

Loading

epage commented Mar 31, 2023 •

edited

Loading

epage commented Mar 31, 2023 •

edited

Loading

epage commented Apr 11, 2023 •

edited

Loading

asquared31415 commented Apr 11, 2023 •

edited

Loading