Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: str methods that accept/return byte offsets should have byte in the name #10044

Closed
ben0x539 opened this issue Oct 24, 2013 · 14 comments
Closed

Comments

@ben0x539
Copy link
Contributor

People shouldn't accidentally write code that works only for ascii input and 'surprisingly' fails when users input non-ascii-only text, so I figure the byte-offset methods shouldn't have the shorter names or be the more obvious choice than the char-offset methods.

So, like, len => byte_len, slice => slice_bytes, but also find => find_byte_pos or something.

Of course all the docs already mention that offsets are given in bytes, but it doesn't always occur to me that I might even be passing the wrong kind of offset to make me look it up in the first place.

@ben0x539
Copy link
Contributor Author

(Or maybe these methods could go away and we could require users to call .as_bytes().slice() and so on ;))

@huonw
Copy link
Member

huonw commented Oct 24, 2013

(Or maybe these methods could go away and we could require users to call .as_bytes().slice() and so on ;))

The current string methods check that they're only operating on byte-boundaries, which (&[u8]).slice() does not (and, indeed, can not).


From a satefy point-of-view, what about defining a new-type wrapper around uint representing byte indices. E.g. (in non-method form)

struct ByteIndex {
     priv idx: uint
}
impl ByteIndex {
    /* unsafe? */ fn new(idx: uint) -> ByteIndex { idx: idx }
}

fn slice(&'a str, from: ByteIndex, to: ByteIndex) -> &'a str { ... }

fn find_char(&'a str, c: char) -> Option<ByteIndex> { ... }

struct CharRange { 
    ch: char,
    next: ByteIndex
}

fn char_range_at(&str, idx: ByteIndex) -> CharRange { ... }

so one either gets a "blackbox" index from functions like find_char and char_range_at, or one is forced to explicitly create one by hand, with the big "ByteIndex" warning. (This wouldn't allow us to drop the asserts about char boundaries, since one can still create one by hand, and ByteIndexs are not connected with the string from which they came.)

Cons: it makes manipulating strings by hand more complicated.
Pros: it makes manipulating strings by hand require more thought.

@jdm
Copy link
Contributor

jdm commented Oct 24, 2013

Would there be merit to creating a StringIndex type which is either a ByteIndex or CharIndex and getting rid of all the public _byte/_char variants?

@huonw
Copy link
Member

huonw commented Oct 24, 2013

That might be a nice simplification/unification, but that seems like it might be a source of hidden performance problems if everything returns/takes StringIndexs, as it would be very hard to predict when something will be O(n) (CharIndex) and when will be O(1) (ByteIndex). Then again, it might not be so bad in practice.

(Another downside: using an enum would knock it over one word, which would also be slower.)

@bstrie
Copy link
Contributor

bstrie commented Nov 27, 2013

I second the sentiment of putting byte in the name of string methods wherever appropriate. By Rust 1.0 I expect to be able to do the following:

let s = "ë";  // two combined code points
s.byte_len();  // length of the string in bytes (3)
s.char_len();  // length of the string in code points (2)
s.rune_len();  // length of the string in grapheme clusters (1)

All of these have their uses, and assuming that we're going to provide them all it doesn't make sense to silently and misleadingly bless the byte versions as the default.

@huonw
Copy link
Member

huonw commented Nov 27, 2013

s.rune_len();  // length of the string in grapheme clusters (1)

FWIW, Go uses "rune" to be what we call char (i.e. codepoints), so that should be s.grapheme_len() or s.glyph_len() (although we kinda don't actually need them, since s.chars().len() and s.graphemes().len() also works).

@huonw
Copy link
Member

huonw commented Nov 27, 2013

(NB. s.bytes().len() isn't good, since byte length can be retrieved in O(1) (by reading the length of the backing vector as we do now) but the iterator version is O(n); the other forms are O(n) anyway so using iterators is perfect.)

@lilyball
Copy link
Contributor

lilyball commented Dec 2, 2013

@huonw: s.chars().len() and s.graphemes().len() may both be O(n), but they're unnecessarily constructing a data structure that is then thrown away. It's both a waste of performance and of memory.

@thestinger
Copy link
Contributor

@kballard: The chars iterator doesn't have to allocate anything, it just runs the UTF-8 decoder in steps. AFAIK it compiles to the same code as running a UTF-8 decoder directly.

@huonw
Copy link
Member

huonw commented Dec 2, 2013

@kballard with optimisations LLVM will likely perform SROA (scalar replacement of aggregates, i.e. split a struct into its fields) and optimise each part separately, including the actual construction. (Also, the memory cost .chars() is tiny, like a few tens of bytes, .graphemes is more, but still not more than a 100 or so.)

(The only performance problem might be that .chars() actually decodes each char when it could just read the length and skip that many bytes forward; .graphemes doesn't have such an optimisation possible though.)

@lilyball
Copy link
Contributor

lilyball commented Dec 2, 2013

@thestinger Oh those are iterators? Then that's fine.

@thestinger
Copy link
Contributor

@kballard: Yeah, there are still some allocating methods left around but the iterator ones recently had _iter dropped.

@japaric
Copy link
Member

japaric commented Oct 14, 2014

This issue should be addressed (if it hasn't already) as part of the stabilization of std::str

cc @aturon

@steveklabnik
Copy link
Member

This thread isn't the place to track this kind of thing today. Many of these methods have already undergone stabilization, and for those that haven't, that's the correct place. So I'm giving this a close. If anyone still feels strongly about this, please persue in the RFCs repo, thanks.

flip1995 pushed a commit to flip1995/rust that referenced this issue Dec 29, 2022
Fix overflow ICE in large_stack/const_arrays

Change `maximum_allowed_size` config variable type from `u64` to `u128`, and converting total array sizes to `u128` to avoid overflow during multiplication.

Fixes rust-lang#10044

changelog: Fix: [`large_const_arrays`] and [`large_stack_arrays`]: avoid integer overflow when calculating total array size
rust-lang#10103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants