Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add byte and byte string literals #69

Merged
merged 3 commits into from
Jun 4, 2014

Conversation

SimonSapin
Copy link
Contributor

No description provided.

@SimonSapin
Copy link
Contributor Author

@SimonSapin
Copy link
Contributor Author

Previously: rust-lang/rust#4334

Apparently, GitHub’s auto-linking does not apply when rendering in-repo Markdown files.
byte string literals of type `&'static [u8]` (or `[u8]`, post-DST).
They are identical to the existing character and string literals, except that:

* They are prefixed with a `b` (for "binary"), to distinguish them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 for b as a prefix - I don't see anything more or less binary about these chars/strs than regular ones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b is taken from Python, but I’m not especially attached to it. I’d be fine with another syntax. How about one of these? a'\t' (a for ASCII), '\t'u8 (the latter doesn’t really work for strings, though)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the last one. Why wouldn't it work for strings?

@chris-morgan
Copy link
Member

+1 all round.

+1 for raw strings, though I would use br"" rather than rb"" for consistency with Python. (The arbitrary decision is made in Python that the order is br and not rb; similarly for Unicode string literals, ur and not ru.)

+1 for removing bytes!. It’s become fairly useless anyway with that 'static lifetime issue.

@nick29581 with raw strings having come since the discussion in rust-lang/rust#4334, b"…" is now more consistent rather than less as it was at the time. For 't'u8 vs. b't', there’s still precedent either way.

@nrc
Copy link
Member

nrc commented May 6, 2014

I didn't know we had support for raw strings, so I feel a bit better about a b prefix now.

# Unresolved questions

Should there be "raw byte string" literals?
E.g. `pdf_file.write(rb"<< /Title (FizzBuzz \(Part one\)) >>")`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python precedent is for allowing br and forbidding rb (syntax error). Also: yes.

@Aatch
Copy link
Contributor

Aatch commented May 6, 2014

I strongly support this RFC. I was actually planning on writing almost exactly the same RFC myself, so thanks @SimonSapin.

@Valloric
Copy link

Valloric commented May 6, 2014

Very strong +1. Every day I spend writing Rust I wish it had byte string literals.

@jsanders
Copy link

jsanders commented May 6, 2014

👍 This seems really nice, regardless of the specific syntax it ends up being.

@bstrie
Copy link
Contributor

bstrie commented May 6, 2014

I too was going to argue about syntax, but the precedent from Python is good enough for me. +1 on all fronts.

An extra +1 to enforcing br"foo" and disallowing rb"foo". This also makes raw strings look nicer in their extended form: br###"foo"### rather than r###b"foo"###. Please include this in the RFC.

@ben0x539
Copy link

ben0x539 commented May 6, 2014

Do we really want to 'overload' \x for this? Can we use another escape sequence? If so, we could allow \x, \u and \U in byte strings...

@edwardw
Copy link

edwardw commented May 9, 2014

Any chance to borrow some binary pattern matching stuff from Erlang? I find it very powerful and pleasant to use at the same time, e.g. Erlang bit syntax.

@SimonSapin
Copy link
Contributor Author

@edwardw That sounds like a separate RFC. Maybe #29?

@SimonSapin
Copy link
Contributor Author

@ben0x539

Do we really want to 'overload' \x for this?

I do. It follows the precedent of other languages of \x meaning one byte in a byte context.

If so, we could allow \x, \u and \U in byte strings...

It was deliberate to exclude \u and \U in this RFC. What would they even mean?

@ben0x539
Copy link

@SimonSapin

I do. It follows the precedent of other languages of \x meaning one byte in a byte context.

Yeah, but it means \x means rather different things in either flavor of string literal. It doesn't follow the precedent of this very same language. :(

It was deliberate to exclude \u and \U in this RFC. What would they even mean?

What they mean in regular string literals. I mean, really byte string literals are just regular string literals without the UTF-8 invariant and hence a different type, the syntax doesn't need to be completely different.

@SimonSapin
Copy link
Contributor Author

Yeah, but it means \x means rather different things in either flavor of string literal.

I don’t see a problem here. This difference is precisely what makes byte literals different from Unicode literals in the first place…

What they mean in regular string literals

Meaning "Just assume UTF-8". I’m opposed to this. The point of working with bytes rather than Unicode is that you don’t necessarily know the encoding (other than it’s ASCII-compatible), so assuming a particular encoding is not appropriate. I could cause Mojibake or other related bugs.

I suppose we have a different vision of what str is. You seem to think of it as a byte string that just happens to hold an invariant of being a valid UTF-8 sequence. I think of it as sequence of Unicode scalar values (roughly: code points) that just happen to be represented in memory as UTF-8 bytes.

@pcwalton
Copy link
Contributor

I like this as a potential solution for paths.

@Valloric
Copy link

\x should be removed in non-byte string literals. I've complained about this before in rust-lang/rust#12769 but now it makes even more sense since it will work as intended in byte string literals and will be actively harmful in utf8 string literals. The only thing \x will do there is confuse users and produce bugs since people will adapt algorithms from C, C++ etc and then forget to add the b prefix.

Removing \x from utf8 literals would prevent a whole series of possible bugs without removing a single shred of functionality because \xXX in utf8 literals is the same as \u00XX.

@SimonSapin
Copy link
Contributor Author

@pcwalton what about paths? Filenames on Unix are fundamentally bytes that should only be interpreted in some encoding (nowadays often UTF-8, but not always, if you have an external hard drive from 1995). But on Windows they’re UTF-16. (Or maybe UCS-2.) I don’t see how byte literals would help std::path.

@SimonSapin
Copy link
Contributor Author

\x should be removed in non-byte string literals

How about restricting it (for Unicode literals) to the ASCII range, where it maps to a single UTF-8 byte?

@Valloric
Copy link

How about restricting it (for Unicode literals) to the ASCII range, where it maps to a single UTF-8 byte?

I can live with that.

My main concern is people writing something like \xFF in 20 different languages and getting one thing and then writing that in Rust and getting another. I've personally been bitten by this bug.

So if we can restrict the range allowed by \x in utf8 literals to produce byte values that the same \x sequence would produce in a byte string, that's fine. We should then also consider adding a nice compiler error message saying something like "prefix your literal with b to make it a byte string" when the user uses \x outside the allowed range in utf8 literals.

@Valloric
Copy link

How about restricting it (for Unicode literals) to the ASCII range, where it maps to a single UTF-8 byte?

One thing though... what purpose would that serve? If we restrict it to the ASCII range, you might as well write a instead of \x61.

@SimonSapin
Copy link
Contributor Author

One thing though... what purpose would that server though?

Same as removing it: avoid the debate of rust-lang/rust#2800

If we restrict it to the ASCII range, you might as well write a instead of \x61.

Yeah of course. But you may still want some of the "non-printable" code points of the ASCII range: U+0000 to U+001F and U+007F.

@brson
Copy link
Contributor

brson commented Jun 4, 2014

Thank you for the contribution. Accepted as RFC 23, per https://github.com/mozilla/rust/wiki/Meeting-weekly-2014-06-03. cc rust-lang/rust#14646

@SimonSapin SimonSapin deleted the ascii-literals branch June 4, 2014 23:18
@SimonSapin
Copy link
Contributor Author

For the record, I realized while implementing this that the combination of decisions in this RFC have two consequences I did not anticipate:

  • Byte literals can be used anywhere u8 can, even if it looks nonsensical. E.g.

    assert_eq!([42, ..b'\t'].as_slice(), &[42, 42, 42, 42, 42, 42, 42, 42, 42]);
  • Since unescaped characters in byte strings are limited to ASCII and raw byte strings do not have escape, it is not possible to write a raw byte string containing non-ASCII bytes.

@Valloric
Copy link

Byte literals can be used anywhere u8 can, even if it looks nonsensical

Doesn't sound like a big deal.

Since unescaped characters in byte strings are limited to ASCII and raw byte strings do not have escape, it is not possible to write a raw byte string containing non-ASCII bytes.

Seems reasonable enough; the limitation is there only for raw byte strings, not plain byte strings. And if you want to put unicode chars in a byte string, you are using escapes either way, so there obviously isn't any sensible reason to want to use a raw byte string.

In other words, the user can't have it both ways; you can't say "I want \uABCD to work but \t to be left alone." The compiler can't read their mind.

@ben0x539
Copy link

with the bytes!() macro you could at least switch between raw and cooked strings and plain u8 numbers etc and it'd all get mashed together...

@SimonSapin
Copy link
Contributor Author

@ben0x539 The plan is to remove bytes!(), since it’s redundant with byte string literals.

@ben0x539
Copy link

Yeah, what I'm saying is that it isn't entirely because it lets you combine differently typed things into a single block of bytes.

bors added a commit to rust-lang/rust that referenced this pull request Jun 18, 2014
See #14646 (tracking issue) and rust-lang/rfcs#69.

This does not close the tracking issue, as the `bytes!()` macro still needs to be removed. It will be later, after a snapshot is made with the changes in this PR, so that the new syntax can be used when bootstrapping the compiler.
@netvl
Copy link

netvl commented Jul 1, 2014

I noticed that current Rust nightly shows deprecation warning on bytes!() macro usage. But how do I write this literal construction without giving up readability?

        static DATA: &'static [u8] = bytes!(
            0, 0, 0, 0, 0, 0, 0, 3,  // # of paths
            0, 8, "/a/b/c/d",
            0, 0,   // theoretically possible
            0, 1, "/"
        );

With byte string literals it will look like this:

    static DATA: &'static [u8] = b"\0\0\0\0\0\0\0\x03\0\x08/a/b/c/d\0\0\0\x01/";

It looks awful compared to bytes!() variant. I think that bytes!() macro should be kept to allow things like these.

@emberian
Copy link
Member

emberian commented Jul 1, 2014

Yeah, I'm not quite convinced that we should remove bytes!() either.

@SimonSapin
Copy link
Contributor Author

@netvl Like in Unicode strings, you can use "escaped newlines", which resolve to nothing: (note the backslashes at the very end of lines.)

    static DATA: &'static [u8] = b"\0\0\0\0\0\0\0\x03\
                                   \0\x08/a/b/c/d\
                                   \0\0\
                                   \0\x01/";

This is only half of what you asked for in that you can’t have comments in the middle of a literals, but I have to say this looks very unusal. Also, out of context I have no idea what this data represents, so I don’t know what syntax makes sense to you.

Perhaps we could use Python’s idea that consecutive (byte) string literals are concatenated:

    static DATA: &'static [u8] = b"\0\0\0\0\0\0\0\x03"  // # of paths
                                 b"\0\x08/a/b/c/d"
                                 b"\0\0"  // theoretically possible
                                 b"\0\x01/";

Removing bytes!() was one of the "Unresolved questions" since the first revision of this RFC, and the only feedback I got was "Yes". This RFC has since been accepted and implemented. I suggest filing separate issues or RFCs if you want further changes.

@netvl
Copy link

netvl commented Jul 2, 2014

Escaped newlines certainly make things better, but still bytes!() version is far more readable :(

I'm not sure whether a suggestion to keep bytes!() should be filed as an RFC. As for creating an issue, do you mean rfcs repo or rust itself? As far as I remember, there is no defined process for such things yet.

@SimonSapin
Copy link
Contributor Author

I meant RFCs on this repo or issues on the rust repo. I don’t know which is more appropriate in this case. Maybe chat on IRC with one of the core team to see what they prefer.

To recap:

  • bytes!() was originally created to express &'static [u8] values based on text rather than numbers.
  • Byte string literals solve the same problem, but IMO better.
  • Since they’re redundant, I believe the original, more hackish solution should be removed.
  • The only case I know of where bytes!() is better is when you want to have non-significant whitespace and comments between parts of the same &'static [u8] value. The other aspect of bytes!() (converting various data types to [u8]) is not so important here.
  • I can think of a number of ways to address this use case:
    • Keep bytes!(). I think this is overkill.
    • Python-style concatenated literals that I mentioned above. (Language change.)
    • Add a new concat_bytes!() macro (or named something else) similar to concat!(), but that only takes and returns &'static [u8] literals.
    • Change existing the concat!() macro to return &'static [u8] when all its arguments are &'static [u8], instead of always returning &'static str.

But as I said, this RFC is done as far as I’m concerned. It’ll be up to you to champion something else through the process. I think anything based on macros is more likely to get accepted than a language change.

@Valloric
Copy link

Valloric commented Jul 5, 2014

There's one use case for bytes!() that I think isn't addressed by the current byte literals support and that's writing non-ASCII chars as a &'static [u8] UTF-8 string. For instance, I have the following in my codebase: bytes!( "葉" ) and I don't see a way to write that with as a byte literal while keeping the original character in the code. This becomes an even bigger problem when you have something like bytes!( "ελληνικά" ). Escaping all the chars by hand makes it extremely hard to comprehend the original text.

@SimonSapin
Copy link
Contributor Author

If it’s not a in "static" context, you can use "ελληνικά".as_bytes().

May I ask why &[u8] is preferred over &str for text that is known to be UTF-8?

@Valloric
Copy link

Valloric commented Jul 5, 2014

If it’s not a in "static" context, you can use "ελληνικά".as_bytes().

But I need it in a static context.

May I ask why &[u8] is preferred over &str for text that is known to be UTF-8?

Precisely because the text isn't known to be UTF-8. Many existing APIs just manipulate byte sequences without caring what's in them. Sometimes those APIs will get ASCII data, sometimes UTF-8, sometimes binary data. An existing networking API for example would be wrapped as accepting a &[u8], not a &str.

My current use case involves talking to an API that takes &[u8], and I have macros and other test code that needs a 'static lifetime. But that's besides the point; Rust needs a way to cover the general use-case of "non-ASCII text in the source code as a sequence of UTF-8 bytes with static lifetime".

There needs to be some way to handle that, otherwise there's a hole. You can get non-ASCII text as &[u8] but without the static bound, and you can get non-ASCII text as 'static or not but the type is &str. We have 3 out of 4 instead of 4 out of 4.

@SimonSapin
Copy link
Contributor Author

I belive that "ελληνικά".as_bytes() also has a static lifetime. By static context I meant in the initializer for a static item rather than a static lifetime. This currently doesn’t work because there is no compile-time evaluation of functions.

That said, I won’t try to block the un-deprecation of bytes!() more than expressing the opinions I already have. But it’s not me you need to convince at this point. I suggest making a new RFC.

@brson
Copy link
Contributor

brson commented Jul 9, 2014

@SimonSapin Is there a followup we need to do here to fix the issues you identified?

@Centril Centril added A-syntax Syntax related proposals & ideas A-expressions Term language related proposals & ideas A-string Proposals relating to strings. labels Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-expressions Term language related proposals & ideas A-string Proposals relating to strings. A-syntax Syntax related proposals & ideas
Projects
None yet
Development

Successfully merging this pull request may close these issues.