You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is fine within Oxc, but it imposes a cost on the consumer in typical case that they're reading the source text from a file. Typically one would use let source_text = std::fs::read_to_string(path);. This has a hidden cost, because it performs UTF-8 validation, which is not so cheap. read_to_string uses std::str::from_utf8 internally, and it's not very efficient - it's not even SIMD-accelerated.
Thanks to @lucab's efforts in oxc-project/oxc#4298 and oxc-project/oxc#4304, the lexer now (mostly) processes input on a byte-by-byte basis, rather than char-by-char. So now it would not be difficult to perform UTF-8 validation at the same time as lexing.
We already have separate code paths for handling Unicode chars, and ASCII text needs no validation at all, so I imagine adding UTF-8 validation would cost nothing on the fast ASCII path, and very little on the Unicode paths (which are very rarely taken anyway). And if we add support for UTF-16 spans (oxc-project/oxc#959) we'd need logic to handle Unicode bytes anyway, so then UTF-8 validation on top of that would be almost entirely free.
Parser would take an AsRef<[u8]> instead of a &str. If source text passes UTF-8 validation, ParserReturn could contain the source text cast to a &str.
We could have individual ByteHandlers for each group of Unicode start bytes (first byte of 2-byte char, 3-byte char, 4-byte char), rather than the single UNI handler we have for all of them now.
NB: When UTF-8 validation fails, would need to sanitize the source text which is used in diagnostics, as error printer relies on source text being a valid &str.
The text was updated successfully, but these errors were encountered:
Currently the parser takes source text as a
&str
.This is fine within Oxc, but it imposes a cost on the consumer in typical case that they're reading the source text from a file. Typically one would use
let source_text = std::fs::read_to_string(path);
. This has a hidden cost, because it performs UTF-8 validation, which is not so cheap.read_to_string
usesstd::str::from_utf8
internally, and it's not very efficient - it's not even SIMD-accelerated.Thanks to @lucab's efforts in oxc-project/oxc#4298 and oxc-project/oxc#4304, the lexer now (mostly) processes input on a byte-by-byte basis, rather than char-by-char. So now it would not be difficult to perform UTF-8 validation at the same time as lexing.
We already have separate code paths for handling Unicode chars, and ASCII text needs no validation at all, so I imagine adding UTF-8 validation would cost nothing on the fast ASCII path, and very little on the Unicode paths (which are very rarely taken anyway). And if we add support for UTF-16 spans (oxc-project/oxc#959) we'd need logic to handle Unicode bytes anyway, so then UTF-8 validation on top of that would be almost entirely free.
Parser would take an
AsRef<[u8]>
instead of a&str
. If source text passes UTF-8 validation,ParserReturn
could contain the source text cast to a&str
.We could have individual
ByteHandler
s for each group of Unicode start bytes (first byte of 2-byte char, 3-byte char, 4-byte char), rather than the singleUNI
handler we have for all of them now.NB: When UTF-8 validation fails, would need to sanitize the source text which is used in diagnostics, as error printer relies on source text being a valid
&str
.The text was updated successfully, but these errors were encountered: