RFC: `str` methods that accept/return byte offsets should have `byte` in the name #10044

ben0x539 · 2013-10-24T02:58:19Z

People shouldn't accidentally write code that works only for ascii input and 'surprisingly' fails when users input non-ascii-only text, so I figure the byte-offset methods shouldn't have the shorter names or be the more obvious choice than the char-offset methods.

So, like, len => byte_len, slice => slice_bytes, but also find => find_byte_pos or something.

Of course all the docs already mention that offsets are given in bytes, but it doesn't always occur to me that I might even be passing the wrong kind of offset to make me look it up in the first place.

The text was updated successfully, but these errors were encountered:

ben0x539 · 2013-10-24T03:00:24Z

(Or maybe these methods could go away and we could require users to call .as_bytes().slice() and so on ;))

huonw · 2013-10-24T10:49:29Z

(Or maybe these methods could go away and we could require users to call .as_bytes().slice() and so on ;))

The current string methods check that they're only operating on byte-boundaries, which (&[u8]).slice() does not (and, indeed, can not).

From a satefy point-of-view, what about defining a new-type wrapper around uint representing byte indices. E.g. (in non-method form)

struct ByteIndex {
     priv idx: uint
}
impl ByteIndex {
    /* unsafe? */ fn new(idx: uint) -> ByteIndex { idx: idx }
}

fn slice(&'a str, from: ByteIndex, to: ByteIndex) -> &'a str { ... }

fn find_char(&'a str, c: char) -> Option<ByteIndex> { ... }

struct CharRange { 
    ch: char,
    next: ByteIndex
}

fn char_range_at(&str, idx: ByteIndex) -> CharRange { ... }

so one either gets a "blackbox" index from functions like find_char and char_range_at, or one is forced to explicitly create one by hand, with the big "ByteIndex" warning. (This wouldn't allow us to drop the asserts about char boundaries, since one can still create one by hand, and ByteIndexs are not connected with the string from which they came.)

Cons: it makes manipulating strings by hand more complicated.
Pros: it makes manipulating strings by hand require more thought.

jdm · 2013-10-24T12:18:12Z

Would there be merit to creating a StringIndex type which is either a ByteIndex or CharIndex and getting rid of all the public _byte/_char variants?

huonw · 2013-10-24T12:38:07Z

That might be a nice simplification/unification, but that seems like it might be a source of hidden performance problems if everything returns/takes StringIndexs, as it would be very hard to predict when something will be O(n) (CharIndex) and when will be O(1) (ByteIndex). Then again, it might not be so bad in practice.

(Another downside: using an enum would knock it over one word, which would also be slower.)

bstrie · 2013-11-27T14:39:31Z

I second the sentiment of putting byte in the name of string methods wherever appropriate. By Rust 1.0 I expect to be able to do the following:

let s = "ë";  // two combined code points
s.byte_len();  // length of the string in bytes (3)
s.char_len();  // length of the string in code points (2)
s.rune_len();  // length of the string in grapheme clusters (1)

All of these have their uses, and assuming that we're going to provide them all it doesn't make sense to silently and misleadingly bless the byte versions as the default.

huonw · 2013-11-27T21:18:30Z

s.rune_len();  // length of the string in grapheme clusters (1)

FWIW, Go uses "rune" to be what we call char (i.e. codepoints), so that should be s.grapheme_len() or s.glyph_len() (although we kinda don't actually need them, since s.chars().len() and s.graphemes().len() also works).

huonw · 2013-11-27T21:20:55Z

(NB. s.bytes().len() isn't good, since byte length can be retrieved in O(1) (by reading the length of the backing vector as we do now) but the iterator version is O(n); the other forms are O(n) anyway so using iterators is perfect.)

lilyball · 2013-12-02T01:46:28Z

@huonw: s.chars().len() and s.graphemes().len() may both be O(n), but they're unnecessarily constructing a data structure that is then thrown away. It's both a waste of performance and of memory.

thestinger · 2013-12-02T01:48:47Z

@kballard: The chars iterator doesn't have to allocate anything, it just runs the UTF-8 decoder in steps. AFAIK it compiles to the same code as running a UTF-8 decoder directly.

huonw · 2013-12-02T01:50:32Z

@kballard with optimisations LLVM will likely perform SROA (scalar replacement of aggregates, i.e. split a struct into its fields) and optimise each part separately, including the actual construction. (Also, the memory cost .chars() is tiny, like a few tens of bytes, .graphemes is more, but still not more than a 100 or so.)

(The only performance problem might be that .chars() actually decodes each char when it could just read the length and skip that many bytes forward; .graphemes doesn't have such an optimisation possible though.)

lilyball · 2013-12-02T01:53:51Z

@thestinger Oh those are iterators? Then that's fine.

thestinger · 2013-12-02T01:54:31Z

@kballard: Yeah, there are still some allocating methods left around but the iterator ones recently had _iter dropped.

japaric · 2014-10-14T01:10:47Z

This issue should be addressed (if it hasn't already) as part of the stabilization of std::str

cc @aturon

steveklabnik · 2015-02-16T19:14:57Z

This thread isn't the place to track this kind of thing today. Many of these methods have already undergone stabilization, and for those that haven't, that's the correct place. So I'm giving this a close. If anyone still feels strongly about this, please persue in the RFCs repo, thanks.

Fix overflow ICE in large_stack/const_arrays Change `maximum_allowed_size` config variable type from `u64` to `u128`, and converting total array sizes to `u128` to avoid overflow during multiplication. Fixes rust-lang#10044 changelog: Fix: [`large_const_arrays`] and [`large_stack_arrays`]: avoid integer overflow when calculating total array size rust-lang#10103

nalimilan mentioned this issue Dec 10, 2014

Restrict indexing into strings to a special ByteIndex or StringIndex type JuliaLang/julia#9297

Closed

steveklabnik closed this as completed Feb 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: `str` methods that accept/return byte offsets should have `byte` in the name #10044

RFC: `str` methods that accept/return byte offsets should have `byte` in the name #10044

ben0x539 commented Oct 24, 2013

ben0x539 commented Oct 24, 2013

huonw commented Oct 24, 2013

jdm commented Oct 24, 2013

huonw commented Oct 24, 2013

bstrie commented Nov 27, 2013

huonw commented Nov 27, 2013

huonw commented Nov 27, 2013

lilyball commented Dec 2, 2013

thestinger commented Dec 2, 2013

huonw commented Dec 2, 2013

lilyball commented Dec 2, 2013

thestinger commented Dec 2, 2013

japaric commented Oct 14, 2014

steveklabnik commented Feb 16, 2015

RFC: str methods that accept/return byte offsets should have byte in the name #10044

RFC: str methods that accept/return byte offsets should have byte in the name #10044

Comments

ben0x539 commented Oct 24, 2013

ben0x539 commented Oct 24, 2013

huonw commented Oct 24, 2013

jdm commented Oct 24, 2013

huonw commented Oct 24, 2013

bstrie commented Nov 27, 2013

huonw commented Nov 27, 2013

huonw commented Nov 27, 2013

lilyball commented Dec 2, 2013

thestinger commented Dec 2, 2013

huonw commented Dec 2, 2013

lilyball commented Dec 2, 2013

thestinger commented Dec 2, 2013

japaric commented Oct 14, 2014

steveklabnik commented Feb 16, 2015

RFC: `str` methods that accept/return byte offsets should have `byte` in the name #10044

RFC: `str` methods that accept/return byte offsets should have `byte` in the name #10044