Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer validate UTF-8 #93

Open
overlookmotel opened this issue Aug 22, 2024 · 1 comment
Open

Lexer validate UTF-8 #93

overlookmotel opened this issue Aug 22, 2024 · 1 comment

Comments

@overlookmotel
Copy link

overlookmotel commented Aug 22, 2024

Currently the parser takes source text as a &str.

This is fine within Oxc, but it imposes a cost on the consumer in typical case that they're reading the source text from a file. Typically one would use let source_text = std::fs::read_to_string(path);. This has a hidden cost, because it performs UTF-8 validation, which is not so cheap. read_to_string uses std::str::from_utf8 internally, and it's not very efficient - it's not even SIMD-accelerated.

Thanks to @lucab's efforts in oxc-project/oxc#4298 and oxc-project/oxc#4304, the lexer now (mostly) processes input on a byte-by-byte basis, rather than char-by-char. So now it would not be difficult to perform UTF-8 validation at the same time as lexing.

We already have separate code paths for handling Unicode chars, and ASCII text needs no validation at all, so I imagine adding UTF-8 validation would cost nothing on the fast ASCII path, and very little on the Unicode paths (which are very rarely taken anyway). And if we add support for UTF-16 spans (oxc-project/oxc#959) we'd need logic to handle Unicode bytes anyway, so then UTF-8 validation on top of that would be almost entirely free.

Parser would take an AsRef<[u8]> instead of a &str. If source text passes UTF-8 validation, ParserReturn could contain the source text cast to a &str.

We could have individual ByteHandlers for each group of Unicode start bytes (first byte of 2-byte char, 3-byte char, 4-byte char), rather than the single UNI handler we have for all of them now.

NB: When UTF-8 validation fails, would need to sanitize the source text which is used in diagnostics, as error printer relies on source text being a valid &str.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant