Lexer validate UTF-8 #93

overlookmotel · 2024-08-22T11:09:02Z

Currently the parser takes source text as a &str.

This is fine within Oxc, but it imposes a cost on the consumer in typical case that they're reading the source text from a file. Typically one would use let source_text = std::fs::read_to_string(path);. This has a hidden cost, because it performs UTF-8 validation, which is not so cheap. read_to_string uses std::str::from_utf8 internally, and it's not very efficient - it's not even SIMD-accelerated.

Thanks to @lucab's efforts in oxc-project/oxc#4298 and oxc-project/oxc#4304, the lexer now (mostly) processes input on a byte-by-byte basis, rather than char-by-char. So now it would not be difficult to perform UTF-8 validation at the same time as lexing.

We already have separate code paths for handling Unicode chars, and ASCII text needs no validation at all, so I imagine adding UTF-8 validation would cost nothing on the fast ASCII path, and very little on the Unicode paths (which are very rarely taken anyway). And if we add support for UTF-16 spans (oxc-project/oxc#959) we'd need logic to handle Unicode bytes anyway, so then UTF-8 validation on top of that would be almost entirely free.

Parser would take an AsRef<[u8]> instead of a &str. If source text passes UTF-8 validation, ParserReturn could contain the source text cast to a &str.

We could have individual ByteHandlers for each group of Unicode start bytes (first byte of 2-byte char, 3-byte char, 4-byte char), rather than the single UNI handler we have for all of them now.

NB: When UTF-8 validation fails, would need to sanitize the source text which is used in diagnostics, as error printer relies on source text being a valid &str.

The text was updated successfully, but these errors were encountered:

overlookmotel · 2024-08-25T12:07:43Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexer validate UTF-8 #93

Lexer validate UTF-8 #93

overlookmotel commented Aug 22, 2024 •

edited

Loading

overlookmotel commented Aug 25, 2024

Lexer validate UTF-8 #93

Lexer validate UTF-8 #93

Comments

overlookmotel commented Aug 22, 2024 • edited Loading

overlookmotel commented Aug 25, 2024

overlookmotel commented Aug 22, 2024 •

edited

Loading