Refactor lexer to treat all input characters as UTF-8 #2307

tamaroning · 2023-06-18T14:35:16Z

In this PR, I have modified peek_input(int n), and skip_input(int n) to handle UTF-8 characters.
To do so, I also dramatically modified InputSource to decode utf-8 and buffer its characters.

tamaroning · 2023-06-18T17:15:01Z

gcc/rust/lex/rust-lex.h

+    // Check if the input source is valid as utf-8 and copy all characters to
+    // `chars`.
+    void init ()
+    {


modified InputSource to check the input string is valid utf-8 and push utf-8 characters to its buffer (field) immidiately after an instance of this class is created. (i.e. this method is a post-constructor)
By this, we do not have to decode each Unicode character more than once.

gcc/rust/lex/rust-lex.cc

tamaroning · 2023-06-18T17:18:16Z

gcc/rust/lex/rust-lex.cc

 Codepoint
 Lexer::peek_codepoint_input ()
 {


peek_codepoint_input and skip_codepoint_input are no longer needed.
They are just wrappers of peek_input and skip_input respetively for now.

tamaroning · 2023-06-23T04:25:45Z

gcc/rust/lex/rust-lex.cc

+void
+rust_input_source_test ()
+{
+  std::string src = u8"_abcde\tXYZ\v\f";
+  std::vector<uint32_t> expected
+    = {'_', 'a', 'b', 'c', 'd', 'e', '\t', 'X', 'Y', 'Z', '\v', '\f'};
+  test_buffer_input_source (src, expected);


I have no idea how to convert(?) std::string into FILE so only BufferInputSource is tested now.

Added unit tests for BufferInputSource. See #2307 (comment)

gcc/rust/ChangeLog: * lex/rust-lex.cc (is_float_digit):Change types of param to `uint32_t` (is_x_digit):Likewise (is_octal_digit):Likewise (is_bin_digit):Likewise (check_valid_float_dot_end):Likewise (is_whitespace):Likewise (is_non_decimal_int_literal_separator):Likewise (is_identifier_start):Likewise (is_identifier_continue):Likewise (Lexer::skip_broken_string_input):Likewise (Lexer::build_token):Remove handling BOM (Lexer::parse_in_type_suffix):Modify use of `current_char` (Lexer::parse_in_decimal):Likewise (Lexer::parse_escape):Likewise (Lexer::parse_utf8_escape):Likewise (Lexer::parse_partial_string_continue):Likewise (Lexer::parse_partial_hex_escape):Likewise (Lexer::parse_partial_unicode_escape):Likewise (Lexer::parse_byte_char):Likewise (Lexer::parse_byte_string):Likewise (Lexer::parse_raw_byte_string):Likewise (Lexer::parse_raw_identifier):Likewise (Lexer::parse_non_decimal_int_literal):Likewise (Lexer::parse_decimal_int_or_float):Likewise (Lexer::peek_input):Change return type to `Codepoint` (Lexer::get_input_codepoint_length):Change to return 1 (Lexer::peek_codepoint_input):Change to be wrapper of `peek_input` (Lexer::skip_codepoint_input):Change to be wrapper of `skip_input` (Lexer::test_get_input_codepoint_n_length):Deleted (Lexer::split_current_token):Deleted (Lexer::test_peek_codepoint_input):Deleted (Lexer::start_line):Move backwards (assert_source_content):New helper function for selftest (test_buffer_input_source):New helper function for selftest (test_file_input_source):Likewise (rust_input_source_test):New test * lex/rust-lex.h (rust_input_source_test):New test * rust-lang.cc (run_rust_tests):Add selftest Signed-off-by: Raiki Tamura <[email protected]>

gcc/rust/lex/rust-lex.h

tamaroning · 2023-06-24T15:42:30Z

gcc/rust/lex/rust-lex.h

+  static const int max_column_hint = 80;
+
+  Optional<std::ofstream &> dump_lex_out;
+


These lines are just moved backwards, not changed.

gcc/rust/lex/rust-lex.cc

tamaroning · 2023-06-24T15:49:56Z

@philberty @CohenArthur
I have added unit tests and this pr is ready for review! I left several comments to describe my changes.

philberty · 2023-06-25T20:45:18Z

gcc/rust/lex/rust-lex.cc

-	// return 0xFFFE;
-	return 0;
+  /* TODO: assert that this TokenId is a "simple token" like punctuation and not
+   * like "IDENTIFIER"? */


We have a token enum that you can implement a switch satement on to figure that out.

philberty

This looks good to me nothing to add here. Great work

CohenArthur

Looks great! Amazing work, thank you!

CohenArthur · 2023-06-28T15:40:57Z

gcc/rust/lex/rust-lex.cc

@@ -2504,338 +2499,133 @@ Lexer::parse_char_or_lifetime (Location loc)
    }
 }

+// TODO remove this function


Please open an issue so we don't forget :)

Already raised in #2309

CohenArthur · 2023-06-28T15:41:09Z

gcc/rust/lex/rust-lex.cc

 }

+// TODO remove this function


Mention this function in the issue as well

CohenArthur · 2023-06-28T15:41:18Z

gcc/rust/lex/rust-lex.cc

 }

+// TODO remove this function


tamaroning commented Jun 18, 2023

View reviewed changes

gcc/rust/lex/rust-lex.cc Show resolved Hide resolved

tamaroning commented Jun 18, 2023

View reviewed changes

This was referenced Jun 19, 2023

Unicode support #2287

Open

Refactoring lexer to treat all characters as UTF-8 #2309

Closed

tamaroning force-pushed the uc-refactor branch 5 times, most recently from 358ffbe to 57f3b06 Compare June 23, 2023 03:05

tamaroning commented Jun 23, 2023

View reviewed changes

tamaroning force-pushed the uc-refactor branch from c6d7805 to 8ba8ced Compare June 23, 2023 04:35

tamaroning mentioned this pull request Jun 24, 2023

Fix lexing byte literal #2320

Merged

tamaroning force-pushed the uc-refactor branch from 8ba8ced to 760ed46 Compare June 24, 2023 15:32

tamaroning commented Jun 24, 2023

View reviewed changes

gcc/rust/lex/rust-lex.h Show resolved Hide resolved

tamaroning commented Jun 24, 2023

View reviewed changes

gcc/rust/lex/rust-lex.cc Show resolved Hide resolved

tamaroning changed the title ~~(WIP) Refactor lexer to treat all input characters as UTF-8~~ Refactor lexer to treat all input characters as UTF-8 Jun 24, 2023

philberty reviewed Jun 25, 2023

View reviewed changes

philberty approved these changes Jun 25, 2023

View reviewed changes

philberty added the enhancement label Jun 25, 2023

philberty requested review from CohenArthur and P-E-P June 25, 2023 20:48

philberty added this to the AST Pipeline for libcore 1.49 Complete milestone Jun 25, 2023

philberty assigned CohenArthur Jun 25, 2023

This was referenced Jun 28, 2023

Missing tests for utf-8 identifiers #2338

Merged

Fix lexer to skip utf-8 whitespaces #2339

Merged

CohenArthur approved these changes Jun 28, 2023

View reviewed changes

CohenArthur added this pull request to the merge queue Jun 28, 2023

Merged via the queue into Rust-GCC:master with commit 3d2a0c0 Jun 28, 2023

tamaroning mentioned this pull request Jun 29, 2023

Remove unnecessary methods/fields of Rust::Lexer #2347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor lexer to treat all input characters as UTF-8 #2307

Refactor lexer to treat all input characters as UTF-8 #2307

tamaroning commented Jun 18, 2023 •

edited

Loading

tamaroning Jun 18, 2023 •

edited

Loading

tamaroning Jun 18, 2023 •

edited

Loading

tamaroning Jun 23, 2023

tamaroning Jun 24, 2023 •

edited

Loading

tamaroning Jun 24, 2023

tamaroning commented Jun 24, 2023

philberty Jun 25, 2023

philberty left a comment

CohenArthur left a comment

CohenArthur Jun 28, 2023

tamaroning Jun 29, 2023

CohenArthur Jun 28, 2023

CohenArthur Jun 28, 2023

		static const int max_column_hint = 80;

		Optional<std::ofstream &> dump_lex_out;

Refactor lexer to treat all input characters as UTF-8 #2307

Refactor lexer to treat all input characters as UTF-8 #2307

Conversation

tamaroning commented Jun 18, 2023 • edited Loading

tamaroning Jun 18, 2023 • edited Loading

Choose a reason for hiding this comment

tamaroning Jun 18, 2023 • edited Loading

Choose a reason for hiding this comment

tamaroning Jun 23, 2023

Choose a reason for hiding this comment

tamaroning Jun 24, 2023 • edited Loading

Choose a reason for hiding this comment

tamaroning Jun 24, 2023

Choose a reason for hiding this comment

tamaroning commented Jun 24, 2023

philberty Jun 25, 2023

Choose a reason for hiding this comment

philberty left a comment

Choose a reason for hiding this comment

CohenArthur left a comment

Choose a reason for hiding this comment

CohenArthur Jun 28, 2023

Choose a reason for hiding this comment

tamaroning Jun 29, 2023

Choose a reason for hiding this comment

CohenArthur Jun 28, 2023

Choose a reason for hiding this comment

CohenArthur Jun 28, 2023

Choose a reason for hiding this comment

tamaroning commented Jun 18, 2023 •

edited

Loading

tamaroning Jun 18, 2023 •

edited

Loading

tamaroning Jun 18, 2023 •

edited

Loading

tamaroning Jun 24, 2023 •

edited

Loading