Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor lexer to treat all input characters as UTF-8 #2307

Merged
merged 1 commit into from
Jun 28, 2023

Conversation

tamaroning
Copy link
Contributor

@tamaroning tamaroning commented Jun 18, 2023

Addresses #2287, #2309

In this PR, I have modified peek_input(int n), and skip_input(int n) to handle UTF-8 characters.
To do so, I also dramatically modified InputSource to decode utf-8 and buffer its characters.

Comment on lines +327 to +306
// Check if the input source is valid as utf-8 and copy all characters to
// `chars`.
void init ()
{
Copy link
Contributor Author

@tamaroning tamaroning Jun 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified InputSource to check the input string is valid utf-8 and push utf-8 characters to its buffer (field) immidiately after an instance of this class is created. (i.e. this method is a post-constructor)
By this, we do not have to decode each Unicode character more than once.

Comment on lines 2511 to 2514
Codepoint
Lexer::peek_codepoint_input ()
{
Copy link
Contributor Author

@tamaroning tamaroning Jun 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peek_codepoint_input and skip_codepoint_input are no longer needed.
They are just wrappers of peek_input and skip_input respetively for now.

@tamaroning tamaroning force-pushed the uc-refactor branch 5 times, most recently from 358ffbe to 57f3b06 Compare June 23, 2023 03:05
Comment on lines 2577 to 2594
void
rust_input_source_test ()
{
std::string src = u8"_abcde\tXYZ\v\f";
std::vector<uint32_t> expected
= {'_', 'a', 'b', 'c', 'd', 'e', '\t', 'X', 'Y', 'Z', '\v', '\f'};
test_buffer_input_source (src, expected);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how to convert(?) std::string into FILE so only BufferInputSource is tested now.

Copy link
Contributor Author

@tamaroning tamaroning Jun 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests for BufferInputSource. See #2307 (comment)

gcc/rust/ChangeLog:

	* lex/rust-lex.cc (is_float_digit):Change types of param to `uint32_t`
	(is_x_digit):Likewise
	(is_octal_digit):Likewise
	(is_bin_digit):Likewise
	(check_valid_float_dot_end):Likewise
	(is_whitespace):Likewise
	(is_non_decimal_int_literal_separator):Likewise
	(is_identifier_start):Likewise
	(is_identifier_continue):Likewise
	(Lexer::skip_broken_string_input):Likewise
	(Lexer::build_token):Remove handling BOM
	(Lexer::parse_in_type_suffix):Modify use of `current_char`
	(Lexer::parse_in_decimal):Likewise
	(Lexer::parse_escape):Likewise
	(Lexer::parse_utf8_escape):Likewise
	(Lexer::parse_partial_string_continue):Likewise
	(Lexer::parse_partial_hex_escape):Likewise
	(Lexer::parse_partial_unicode_escape):Likewise
	(Lexer::parse_byte_char):Likewise
	(Lexer::parse_byte_string):Likewise
	(Lexer::parse_raw_byte_string):Likewise
	(Lexer::parse_raw_identifier):Likewise
	(Lexer::parse_non_decimal_int_literal):Likewise
	(Lexer::parse_decimal_int_or_float):Likewise
	(Lexer::peek_input):Change return type to `Codepoint`
	(Lexer::get_input_codepoint_length):Change to return 1
	(Lexer::peek_codepoint_input):Change to be wrapper of `peek_input`
	(Lexer::skip_codepoint_input):Change to be wrapper of `skip_input`
	(Lexer::test_get_input_codepoint_n_length):Deleted
	(Lexer::split_current_token):Deleted
	(Lexer::test_peek_codepoint_input):Deleted
	(Lexer::start_line):Move backwards
	(assert_source_content):New helper function for selftest
	(test_buffer_input_source):New helper function for selftest
	(test_file_input_source):Likewise
	(rust_input_source_test):New test
	* lex/rust-lex.h (rust_input_source_test):New test
	* rust-lang.cc (run_rust_tests):Add selftest

Signed-off-by: Raiki Tamura <[email protected]>
static const int max_column_hint = 80;

Optional<std::ofstream &> dump_lex_out;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are just moved backwards, not changed.

@tamaroning
Copy link
Contributor Author

@philberty @CohenArthur
I have added unit tests and this pr is ready for review! I left several comments to describe my changes.

@tamaroning tamaroning changed the title (WIP) Refactor lexer to treat all input characters as UTF-8 Refactor lexer to treat all input characters as UTF-8 Jun 24, 2023
// return 0xFFFE;
return 0;
/* TODO: assert that this TokenId is a "simple token" like punctuation and not
* like "IDENTIFIER"? */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a token enum that you can implement a switch satement on to figure that out.

Copy link
Member

@philberty philberty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me nothing to add here. Great work

Copy link
Member

@CohenArthur CohenArthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Amazing work, thank you!

@@ -2504,338 +2499,133 @@ Lexer::parse_char_or_lifetime (Location loc)
}
}

// TODO remove this function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open an issue so we don't forget :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already raised in #2309

}

// TODO remove this function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention this function in the issue as well

}

// TODO remove this function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@CohenArthur CohenArthur added this pull request to the merge queue Jun 28, 2023
Merged via the queue into Rust-GCC:master with commit 3d2a0c0 Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants