-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor lexer to treat all input characters as UTF-8 #2307
Conversation
// Check if the input source is valid as utf-8 and copy all characters to | ||
// `chars`. | ||
void init () | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified InputSource
to check the input string is valid utf-8 and push utf-8 characters to its buffer (field
) immidiately after an instance of this class is created. (i.e. this method is a post-constructor)
By this, we do not have to decode each Unicode character more than once.
Codepoint | ||
Lexer::peek_codepoint_input () | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
peek_codepoint_input
and skip_codepoint_input
are no longer needed.
They are just wrappers of peek_input
and skip_input
respetively for now.
358ffbe
to
57f3b06
Compare
gcc/rust/lex/rust-lex.cc
Outdated
void | ||
rust_input_source_test () | ||
{ | ||
std::string src = u8"_abcde\tXYZ\v\f"; | ||
std::vector<uint32_t> expected | ||
= {'_', 'a', 'b', 'c', 'd', 'e', '\t', 'X', 'Y', 'Z', '\v', '\f'}; | ||
test_buffer_input_source (src, expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea how to convert(?) std::string
into FILE
so only BufferInputSource is tested now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unit tests for BufferInputSource
. See #2307 (comment)
gcc/rust/ChangeLog: * lex/rust-lex.cc (is_float_digit):Change types of param to `uint32_t` (is_x_digit):Likewise (is_octal_digit):Likewise (is_bin_digit):Likewise (check_valid_float_dot_end):Likewise (is_whitespace):Likewise (is_non_decimal_int_literal_separator):Likewise (is_identifier_start):Likewise (is_identifier_continue):Likewise (Lexer::skip_broken_string_input):Likewise (Lexer::build_token):Remove handling BOM (Lexer::parse_in_type_suffix):Modify use of `current_char` (Lexer::parse_in_decimal):Likewise (Lexer::parse_escape):Likewise (Lexer::parse_utf8_escape):Likewise (Lexer::parse_partial_string_continue):Likewise (Lexer::parse_partial_hex_escape):Likewise (Lexer::parse_partial_unicode_escape):Likewise (Lexer::parse_byte_char):Likewise (Lexer::parse_byte_string):Likewise (Lexer::parse_raw_byte_string):Likewise (Lexer::parse_raw_identifier):Likewise (Lexer::parse_non_decimal_int_literal):Likewise (Lexer::parse_decimal_int_or_float):Likewise (Lexer::peek_input):Change return type to `Codepoint` (Lexer::get_input_codepoint_length):Change to return 1 (Lexer::peek_codepoint_input):Change to be wrapper of `peek_input` (Lexer::skip_codepoint_input):Change to be wrapper of `skip_input` (Lexer::test_get_input_codepoint_n_length):Deleted (Lexer::split_current_token):Deleted (Lexer::test_peek_codepoint_input):Deleted (Lexer::start_line):Move backwards (assert_source_content):New helper function for selftest (test_buffer_input_source):New helper function for selftest (test_file_input_source):Likewise (rust_input_source_test):New test * lex/rust-lex.h (rust_input_source_test):New test * rust-lang.cc (run_rust_tests):Add selftest Signed-off-by: Raiki Tamura <[email protected]>
static const int max_column_hint = 80; | ||
|
||
Optional<std::ofstream &> dump_lex_out; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are just moved backwards, not changed.
@philberty @CohenArthur |
// return 0xFFFE; | ||
return 0; | ||
/* TODO: assert that this TokenId is a "simple token" like punctuation and not | ||
* like "IDENTIFIER"? */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a token enum that you can implement a switch satement on to figure that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me nothing to add here. Great work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Amazing work, thank you!
@@ -2504,338 +2499,133 @@ Lexer::parse_char_or_lifetime (Location loc) | |||
} | |||
} | |||
|
|||
// TODO remove this function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please open an issue so we don't forget :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already raised in #2309
} | ||
|
||
// TODO remove this function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention this function in the issue as well
} | ||
|
||
// TODO remove this function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Addresses #2287, #2309
In this PR, I have modified
peek_input(int n)
, andskip_input(int n)
to handle UTF-8 characters.To do so, I also dramatically modified
InputSource
to decode utf-8 and buffer its characters.