Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow escaping arbitrary characters #2360

Merged
merged 20 commits into from
Sep 15, 2024
Merged

Allow escaping arbitrary characters #2360

merged 20 commits into from
Sep 15, 2024

Conversation

laniakea64
Copy link
Contributor

Closes #2359

Not sure the error reporting is ideal, and not sure if the tests are covering enough cases?

@casey
Copy link
Owner

casey commented Sep 11, 2024

What do you think about introducing a State enum, which can be declared inside the parse string function, like so:

enum State {
  Initial, // initial state
  Backslash, // after `\`
  Unicode, // after `\u`, but not opening `{`
  UnicodeValue { value: String }, // after `\u{`
}

Then you would do:

let mut state = State::Initial;
while let Some(c) = chars.next() {
  match state {
  }
}

if state != State::Initial {
  // return error
}

I think this might make the code a little simpler.

Also, I think we should see if the error messages from u32::from_str_radix are good enough to use without needing to define our own error type. If they're reasonable, then we could accept any characters between \u{ and }, and let u32::from_str_radix complain if they're not hex, or if they overflow a u32.

@laniakea64
Copy link
Contributor Author

Done. It does simplify the code and looks better, thanks for the suggestions!

we could accept any characters between \u{ and }, and let u32::from_str_radix complain if ... they overflow a u32.

The syntax we're implementing only accepts up to 6 hex digits, while u32 is up to 8 hex digits, so the code needs to keep its own length check.

@casey
Copy link
Owner

casey commented Sep 14, 2024

Looking at the error message from std::num::ParseIntError, I think it's actually not great, so I'm coming around to doing more checking ourselves and not delegating the error message.

I think we should have these errors:

  • UnicodeEscapeEmpty for \u{}
  • UnicodeEscapeCharacter { character: char } for \u{X} where X is anything other than 0-9a-fA-F
  • UnicodeEscapeLength for when there are more than six characters inside an escape sequence
  • UnicodeEscapeRange when the u32 is not in range for a char, the error message should mention that 10FFFF is the maximum valid codepoint
  • UnicodeEscapeUnterminated when we don't encounter } before the end of the string

This should allow us to just .unwrap when calling from_str_radix, since it will be guaranteed to be a valid hex string that's in range for a u32.

Sorry for waffling!

@laniakea64
Copy link
Contributor Author

I think we should have these errors:

But when just encounters an invalid escape sequence, it points at the entire string, which means the errors need more information than that to pinpoint. To make the exact suggested errors awesome, could just please point only at the malformed escape sequence? 🙂

In the mean time, pushed a commit trying to find a middle ground, since digging into why just points at the whole string is more than I have capacity for atm 😦

@casey
Copy link
Owner

casey commented Sep 15, 2024

Compilation error message context is based on tokens, so since a string is one token, by default it points at the whole string. This could definitely be improved. The easiest solution would be to construct a new, fake token which only points at the sub-section of the string containing the invalid escape sequence, but that's probably best left to a follow-up PR.

@casey casey enabled auto-merge (squash) September 15, 2024 09:58
@casey casey merged commit d2c66e9 into casey:master Sep 15, 2024
5 checks passed
@casey
Copy link
Owner

casey commented Sep 15, 2024

Merged! This is great. Tweaked the code a little, plus error messages.

@laniakea64
Copy link
Contributor Author

Tweaked the code a little, plus error messages.

Nice, I like the tweaks, thanks! Shows a few Rust syntaxes I'm not sufficiently familiar with.

@laniakea64 laniakea64 mentioned this pull request Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow escaping arbitrary characters in double-quoted strings
2 participants