-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexer number parsing refactor #11211
Lexer number parsing refactor #11211
Conversation
Please let's keep the discussion in this PR focused on the actual implementation. |
Now that #11447 is merged, the behaviour on too long hex/oct/bin integer literals is inconsistent with this branch. Should I change the behavior in this branch? |
@BlobCodes Oh sorry, I didn't realize how much my work interfered with this. Luckily, as per your refactor, the code I changed will be fully thrown out in any case, so the merge isn't difficult.
But yes, all the specs I added there were intentionally decided and should not be violated. You can consider these as real spec failures to fix :/ In any case, I will now closely follow this pull request, hope to not keep it stalled.
|
Alright, but before resolving the conflict, I'll wait a bit until the discussions about intended integer parsing behaviour are settled. Currently, the master and PR |
Also, I think the format that the current specs demand is inconsistent. 0o12345671234567_12345671234567_i8
# => 0o12345671234567_12345671234567_i8 doesn't fit in an Int8
12345671234567_12345671234567_i8
# => 1234567123456712345671234567 doesn't fit in an Int8 Both the underscores and the suffix are inconsistent between base-10 and all other bases. |
@BlobCodes You are correct that it's inconsistent and should be made consistent. The most recent discussion was in #11447 (comment) and, although I argued to remove underscores, we ended up keeping them. By that logic, they should probably be kept in this example too. Or maybe it's worth opening yet another issue about it |
There doesn't seem to be a spec that requires the removal of underscores in base-10 numbers. We could maybe remove the suffix in the base-2^n error message, but add the underscores in base-10 representation, as it makes the number more readable. |
I have now made these changes: Before:
1.0_u32 # => unexpected token: u32
1__2 # => trailing '_' in number
0xFF_i8 # => 255 doesn't fit in an Int8
0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF # => integer literal too large
0x00000000000000000000000000000000000000 # => integer literal too large
3_000_i8 # => 3000 doesn't fit in an Int8
0xF_F_i8 # => 255 doesn't fit in an Int8
After:
1.0_u32 # => Invalid suffix u32 for decimal number
1__2 # => consecutive underscores in numbers aren't allowed
0xFF_i8 # => 0xFF doesn't fit in an Int8
0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF # => 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF doesn't fit in an UInt64
0x00000000000000000000000000000000000000 # => 0
3_000_i8 # => 3_000 doesn't fit in an Int8
0xF_F_i8 # => 0xF_F doesn't fit in an Int8 I think this should be a good compromise (no suffix for bin-/oct-/hex-numbers, underscores in base-10 numbers) |
I tested a bit (VERY basic test, only tests number
We should probably test with more numbers, but performance loss should be negligable if any. Code: (as zip because of github) |
Co-authored-by: Sijawusz Pur Rahnama <[email protected]>
I found out that hexadecimal integers, which are too big to fit in a u64, but have few enough digits that they don't match or exceed the maximum digit count, trip a compiler error. I'll push a fix soon |
@BlobCodes That sounds like it really needs a spec? |
BTW I only noticed this because I locally implemented #10154, but unsurprisingly e-notation literals like 10e100u64 are easily under the 19 digit limit while exceeding a u64's capacity. |
This is a refactor to the lexer number parsing.
It was seperated from #11196 to ease maintainers life's by completely seperating final int128 support from the refactor needed for this goal.
This PR fixes some bugs in current integer parsing and creates new rules.
This PR condenses the methods scan_number, scan_zero_number, check_integer_literal_fits_in_size, deduce_integer_kind, scan_bin_number, scan_octal_number, scan_hex_number into one method - scan_number
Lots of methods have been removed as they are unneeded now.
Overall, this PR removes 318 LOC from the lexer (number parsing methods now only require 197 LOC total).
Overall the code is now much easier to understand.
This PR is required for int128 literal support, because it removes the dependencies on a fixed size number type from the lexer (no static to_u64 anymore, any bit size can now be added without much effort)
Fixes #11191
Closes #11203
Closes #11214