Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACP: Add FromByteStr trait with blanket impl FromStr #287

Open
tgross35 opened this issue Oct 26, 2023 · 18 comments
Open

ACP: Add FromByteStr trait with blanket impl FromStr #287

tgross35 opened this issue Oct 26, 2023 · 18 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@tgross35
Copy link

tgross35 commented Oct 26, 2023

Proposal

Problem statement

Many data forms that can be parsed from a string representation do not need UTF-8. Here, FromStr is unnecessarily restrictive because a byte slice &[u8] cannot be parsed directly. Instead, a UTF-8 &str must be constructed to use .parse().

This is inconvenient when working with any raw buffers where one cannot assume that str::from_utf8 will be successful, nor is there any reason to incur the UTF-8 verification overhead. An example is IP addresses, for which there is an unstable from_bytes function: rust-lang/rust#101035

Motivating examples or use cases

Any input data where UTF-8 cannot be guaranteed: stdin, file paths, data from Read, network packets, no_std without UTF tables, any data read one byte at a time, etc.

Any output data that doesn't require specific knowledge of UTF-8: integers, floating point, IP/socket addresses, MAC addresses, UUIDs, etc.

Solution sketch

Add a trait that mirrors FromStr but works with &[u8] byte slices:

// Likely located in core::slice
pub trait FromByteStr: Sized {
    type Err;

    // Required method
    fn from_byte_str(bytes: &[u8]) -> Result<Self, Self::Err>;
}

This will get a corresponding parse on &[u8]

impl<[u8]> {
    pub fn parse<F>(&self) -> Result<F, <F as FromByteStr>::Err>
        where F: FromByteStr
    { /* ... */ }
}

Since &str is necessarily represented as &[u8], we can provide a blanket impl so no types need to be duplicated:

impl<T> FromStr for T where T: FromByteStr {
    type Err = T::Err;
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        s.as_bytes().parse()
    }
}

If this is done, almost all types in std that implement FromStr will be able to switch to a FromByteStr implementation:

Alternatives

Open Questions

  • Should the return type and possibly Err be able to reference the source bytes? (ACP: Add FromByteStr trait with blanket impl FromStr #287 (comment) and its followup)
  • Should we name the associated type Error rather than Err? This isn't consistent with FromStr but is more consistent with TryFrom and TryInto, as well as the rest of the ecosystem (where Err is usually only Result::Err, Error is an error type)

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@tgross35 tgross35 added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Oct 26, 2023
@tgross35
Copy link
Author

@thomcc
Copy link
Member

thomcc commented Oct 26, 2023

FromBytes feels like it might be too general of a name, it implies binary deserialization rather than textual parsing to me.

@tgross35
Copy link
Author

Agreed after rereading it. I had FromByteStr in alternatives and just added FromAscii, do either of those sound like a better fit?

@tgross35 tgross35 changed the title ACP: Add FromBytes trait with default impl to FromStr ACP: Add FromBytes/FromByteStr trait with blanket impl FromStr Oct 26, 2023
@BurntSushi
Copy link
Member

The lack of parsing working on &[u8] has annoyed me on several occasions over the years. This design looks plausible to me.

My inclination is to approve this, but since it was just opened, I figure I'll give it a little time to bake to give others a chance to chime in.

@shepmaster
Copy link
Member

One complaint I’ve heard about FromStr is that the result cannot reference the input. That is, you can’t do &'a str -> MyType<'a>. Is that worth changing here?

@tgross35
Copy link
Author

tgross35 commented Oct 26, 2023

That is an interesting thought. Do you mean a signature like this?

trait FromByteStr<'a>: Sized {
    type Err;
    fn from_byte_str(bytes: &'a [u8]) -> Result<Self, Self::Err>;
}

struct Foo<'a>(&'a [u8]);
impl<'a> FromByteStr<'a> for Foo<'a> {
    type Err = ();
    fn from_byte_str(bytes: &'a [u8]) -> Result<Self, Self::Err> {
        Ok(Self(bytes))
    }
}

impl FromByteStr<'_> for i32 {
    type Err = std::num::ParseIntError;
    fn from_byte_str(bytes: &[u8]) -> Result<Self, Self::Err> {
        std::str::from_utf8(bytes).unwrap().parse()
    }
}

// ...I think static is correct here?
impl<T: FromByteStr<'static>> FromStr for T {
    type Err = T::Err;
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        todo!()
    }
}

That seems reasonable to me if we are okay with it not being a 1:1 matchup with FromStr. Which I think is probably okay, the biggest thing IMO is we need to be able to provide a FromStr blanket impl (I think we can with the above). Maybe also nice to use Error instead of Err to be consistent with everything else, but that's a bit of a bike shed....

I think that Err could also be made to optionally depend on 'a but I'm not sure what that signature would look like

@scottmcm
Copy link
Member

I like the naming direction of "from byte str", since it helps emphasize it's like b"127.0.0.1", rather than like [0x7f_u8, 0, 0, 1]. Anything that says "bytes" makes me think from_ne_bytes and friends instead.

@tgross35 tgross35 changed the title ACP: Add FromBytes/FromByteStr trait with blanket impl FromStr ACP: Add FromByteStr trait with blanket impl FromStr Nov 1, 2023
@tgross35
Copy link
Author

tgross35 commented Nov 1, 2023

I updated the proposal to FromByteStr::from_byte_str, I agree that seems like the most unambiguous option.

It seems like everyone is either neutral or in favor, so I'll make a PR for this in the next week just to see how well everything fits together. @shepmaster's question is probably the biggest thing to answer.

@joshtriplett
Copy link
Member

cc @epage ; would this work for clap, as a basis for parsing user types without going through strings?

@joshtriplett
Copy link
Member

We discussed this in today's @rust-lang/libs-api meeting. We think it's important to distinguish between two possible interpretations of this:

  1. As purely an optimization, where the string is expected to be "unvalidated but expected to be valid UTF-8", and this trait just avoids doing the work of validating UTF-8 when that isn't needed.
  2. As a trait for parsing things from byte strings that aren't necessarily UTF-8. Case study: impl FromByteStr for std::os::unix::net::SocketAddr, which could accept an arbitrary file path.

We do still need to make it clear that this is parsing text, not binary; IpAddr is a good example of that, as @scottmcm noted. Parsing file paths or similar would be acceptable in case 2 above but not in case 1, but parsing binary would always be a misuse.

Please consider those two cases, document the methods accordingly, and we'll re-evaluate it accordingly. Thank you.

@joshtriplett
Copy link
Member

Speaking for myself here (not the team consensus): I think case 2 is something that there's pent-up demand for in the ecosystem, and it seems unlikely that we'd be able to enforce exclusively case 1. I think there's value in providing and embracing case 2.

@epage
Copy link

epage commented Nov 14, 2023

@joshtriplett maybe?

clap would be working with &OsStr. We'd used OsStr::as_encoded_bytes with FromByteStr. If the FromBytesStr impl wants to interpret the non-UTF8 parts, the only tools they have are

  • OsStr::from_encoded_bytes_unchecked: except they won't be able to guarantee that they are only being called with a &str or &OsStr to meet with safety requirements
  • Platform specific code (acting as if OsStrExt::as_bytes was used instead of OsStr::as_encoded_bytes)

If someone made impl FromByteStr for &OsStr (or added &OsStr::from_encoded_bytes), then it would likely be more usable for clap. When we added the encoded_bytes API, that was deferred out.

@clarfonthey
Copy link

I'm mostly against this because I think that something better should be designed than FromStr. Basically:

  1. TryFrom covers a lot of the cases where people just want a falliable conversion from a string, and it works with lifetimes in rather nice ways (both the error and the result can depend on the lifetime of the input).
  2. Some things would like to distinguish "conversion from string" and "parsing from string" in a way that would affect things. For example, a JSON value could be converted from a string (think: From or TryFrom) by creating a JSON string, but it could be parsed from a string which would yield a different result.
  3. Single-string parsing is a relatively inflexible case that falls apart even in simple cases. For example, without the internals of the parser for IpAddr, for example, it's difficult to implement a parser for SocketAddr without hacks.

Overall, this motivates me to conclude that a more generic parsing trait would be a fit replacement for FromStr, rather than simply adding in extra versions of TryFrom that perform particular parsing concerns.

At one point I was trying to experiment with what a useful API would look like for this, and this is about as far as I got. It's not complete or battle-tested, which is why I haven't submitted any ACPs or RFCs yet, but it's probably a decent example of why I think this approach as-is isn't very desirable.

@BurntSushi
Copy link
Member

@clarfonthey I'm not sure that direction is the right path for std. That's a ton of infrastructure and surface area. (I realize what you have is a draft, so I'm speaking in broad strokes here.) I think there's room for a simpler solution that gets us 80% of the way there in std.

@clarfonthey
Copy link

clarfonthey commented Nov 16, 2023

What I have is 100% a draft, and that's the main reason why I haven't shared it until now. ;)

Honestly, the only reason I share it at all is because I want to demonstrate that an alternative is possible, it just needs some extra work.

I mostly share what I have to demonstrate the kind of direction I was going in. You're right that the goal should be simple, and that's why I think the current version I have is insufficient.

However, I think that there are a two simple operations which should be possible under a theoretical final version:

  1. Continuing on insufficient input by providing more input
  2. Stopping instead of erroring on extraneous input (and letting you continue where you left off)

Neither of these is a particularly big ask for any particular implementation, and they don't provide the ability to create complex parser combinators on their own. However, they're both vital to a proper parser implementation, and I don't think that anyone writing their own parsers should have to reimplement libstd parsers (for things like IP addresses) or hack in weird cases just because the API isn't really compatible.

To counter the API surface: I wouldn't consider these much more complex than iterators: the potential API surface is large, but it doesn't have to be. I think that an ideal API would be one that offers a relatively simple trait that's implemented by libstd types, but leaves out all the extra adapters and fancy methods to ecosystem crates.

@clarfonthey
Copy link

I decided to actually write up #296 to explain my design process behind the example I shared, particularly what features I felt would be useful to have from a potential design and why it was designed the way it was. If you follow the motivation, you'll get something very similar to what I designed, with a few potential changes.

It mostly runs counter to this ACP since it explicitly explains why I don't think that adding differently-named copies of TryFrom is a good solution, and what the solution should look like. But, it's worth knowing that said solution will inevitably be much more complicated than this solution, and that might be a reason to look in this direction after all.

Ultimately, hopefully it'll help us decide what we finally want to do with FromStr, since I truly believe that it should be deprecated either way. (Unless you can add GATs retroactively, which doesn't feel like a thing.)

@epage
Copy link

epage commented Nov 17, 2023

While I'm unsure if I personally have a use case for FromByteStr, I feel like it has a distinct role from TryFrom, just like FromStr and it offers a small, well confined API that is fairly easy to stabilize.

With #296, you are effectively proposing taking on the core of a full, general purpose parser into the standard library. I feel like a case would need to be made for why this is important vs third-party, why a specific design is picked (especially a new design) compared to the different third-party parser trait designs, etc. As an outsider to the libs team, I assume this would need to be vetted in a third-party package and likely collaborated on with parser maintainers. This feels like a completely different beast than FromByteStr in what it is trying to solve and has a long road to stabilization that I don't see it justifiable to block smaller forms of progress on it.

@clarfonthey
Copy link

clarfonthey commented Nov 18, 2023

The thing is, what exactly does FromByteStr gain exactly by being added to the standard library? It's very similar to TryFrom, and it's unclear why this trait particularly helps the standard library over something offered by another crate.

Speaking of which… I believe that most things that implement FromStr don't even have standalone methods that allow parsing from byte strings, which probably should be rectified. For example, from_str_radix exists but from_byte_str_radix doesn't.

Standalone methods for IP addresses and such would also be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

8 participants