Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer number parsing refactor #11211

Merged

Conversation

BlobCodes
Copy link
Contributor

@BlobCodes BlobCodes commented Sep 14, 2021

This is a refactor to the lexer number parsing.
It was seperated from #11196 to ease maintainers life's by completely seperating final int128 support from the refactor needed for this goal.

This PR fixes some bugs in current integer parsing and creates new rules.

# Before
1_.1   # => 1.1
1_e2   # => 100.0
-0u64  # => 0_u64
-0_u64 # Error: Invalid UInt32: -0 (ArgumentError); you've found a bug in the Crystal compiler.
1__2   # => 12
0x_2   # => 2
0_12   # => 12
-0x8000000000000000 # Error: Arithmetic overflow (OverflowError); you've found a bug in the Crystal compiler.
0e40   # Error: unexpected token: e40
0x     # => 0

# After
1_.1   # Error: trailing '_' in number
1_e2   # Error: trailing '_' in number
-0u64  # Error: Invalid negative value -0 for UInt64
-0_u64 # Error: Invalid negative value -0 for UInt64
1__2   # Error: trailing '_' in number
0x_2   # Error: numeric literal without digits
0_12   # Error: octal constants should be prefixed with 0o
-0x8000000000000000 # => -9223372036854775808
0e40   # => 0.0
0x     # Error: numeric literal without digits

This PR condenses the methods scan_number, scan_zero_number, check_integer_literal_fits_in_size, deduce_integer_kind, scan_bin_number, scan_octal_number, scan_hex_number into one method - scan_number

Lots of methods have been removed as they are unneeded now.
Overall, this PR removes 318 LOC from the lexer (number parsing methods now only require 197 LOC total).
Overall the code is now much easier to understand.

This PR is required for int128 literal support, because it removes the dependencies on a fixed size number type from the lexer (no static to_u64 anymore, any bit size can now be added without much effort)

Fixes #11191
Closes #11203
Closes #11214

@BlobCodes BlobCodes mentioned this pull request Sep 14, 2021
1 task
@straight-shoota
Copy link
Member

Please let's keep the discussion in this PR focused on the actual implementation.
I've created #11214 to decide about the expected behaviour.

@HertzDevil HertzDevil added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser topic:stdlib:numeric labels Sep 19, 2021
@BlobCodes
Copy link
Contributor Author

Now that #11447 is merged, the behaviour on too long hex/oct/bin integer literals is inconsistent with this branch.
This branch currently raises a "integer literal too large" error if the number is too large to even convert to base 10, and then seperately checks the base 10 whether it is too large or not.

Should I change the behavior in this branch?

@oprypin oprypin assigned oprypin and unassigned oprypin Nov 30, 2021
@oprypin oprypin self-requested a review November 30, 2021 19:17
@oprypin
Copy link
Member

oprypin commented Nov 30, 2021

@BlobCodes Oh sorry, I didn't realize how much my work interfered with this. Luckily, as per your refactor, the code I changed will be fully thrown out in any case, so the merge isn't difficult.
So basically the git merge should be as follows:

  • lexer.cr: resolve all conflicts in favor of your current code
  • lexer_spec.cr: keep both!

But yes, all the specs I added there were intentionally decided and should not be violated.

You can consider these as real spec failures to fix :/

In any case, I will now closely follow this pull request, hope to not keep it stalled.

Failures:

  1) Lexer says syntax error on "0o1000000000000000000000i64"
     Failure/Error: assert_syntax_error "0o1000000000000000000000i64", "9223372036854775808 doesn't fit in an Int64"

       Expected message to include "9223372036854775808 doesn't fit in an Int64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:349

  2) Lexer says syntax error on "0x10000_0000_0000_0000"
     Failure/Error: assert_syntax_error "0x10000_0000_0000_0000", "0x10000_0000_0000_0000 doesn't fit in an UInt64"

       Expected message to include "0x10000_0000_0000_0000 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:355

  3) Lexer says syntax error on "0o200000_00000000_00000000"
     Failure/Error: assert_syntax_error "0o200000_00000000_00000000", "0o200000_00000000_00000000 doesn't fit in an UInt64"

       Expected message to include "0o200000_00000000_00000000 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:356

  4) Lexer says syntax error on "0b100000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000"
     Failure/Error: assert_syntax_error "0b100000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000", "0b100000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000 doesn't fit in an UInt64"

       Expected message to include "0b100000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:357

  5) Lexer says syntax error on "0x1afafafafafafafafafafaf"
     Failure/Error: assert_syntax_error "0x1afafafafafafafafafafaf", "0x1afafafafafafafafafafaf doesn't fit in an UInt64"

       Expected message to include "0x1afafafafafafafafafafaf doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:359

  6) Lexer says syntax error on "0x1afafafafafafafafafafafi32"
     Failure/Error: assert_syntax_error "0x1afafafafafafafafafafafi32", "0x1afafafafafafafafafafafi32 doesn't fit in an Int32"

       Expected message to include "0x1afafafafafafafafafafafi32 doesn't fit in an Int32" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:360

  7) Lexer says syntax error on "0o1234567123456712345671234567"
     Failure/Error: assert_syntax_error "0o1234567123456712345671234567", "0o1234567123456712345671234567 doesn't fit in an UInt64"

       Expected message to include "0o1234567123456712345671234567 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:361

  8) Lexer says syntax error on "0o12345671234567_12345671234567_i8"
     Failure/Error: assert_syntax_error "0o12345671234567_12345671234567_i8", "0o12345671234567_12345671234567_i8 doesn't fit in an Int8"

       Expected message to include "0o12345671234567_12345671234567_i8 doesn't fit in an Int8" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:362

  9) Lexer says syntax error on "0b100000000000000000000000000000000000000000000000000000000000000000"
     Failure/Error: assert_syntax_error "0b100000000000000000000000000000000000000000000000000000000000000000", "0b100000000000000000000000000000000000000000000000000000000000000000 doesn't fit in an UInt64"

       Expected message to include "0b100000000000000000000000000000000000000000000000000000000000000000 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:363

 10) Lexer lexes "0o1000000000000000000000"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 11) Lexer lexes "0o177777_77777777_77777777"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 12) Lexer lexes "0x00ffffffffffffffff"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 13) Lexer lexes "0o001777777777777777777777"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 14) Lexer lexes "0b001111111111111111111111111111111111111111111111111111111111111111"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 15) Lexer lexes "0o1000000000000000000000"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

@BlobCodes
Copy link
Contributor Author

Alright, but before resolving the conflict, I'll wait a bit until the discussions about intended integer parsing behaviour are settled.

Currently, the master and PR lexer_spec.cr have checks for the same rules, but require different error messages.

@BlobCodes
Copy link
Contributor Author

Also, I think the format that the current specs demand is inconsistent.
Example:

0o12345671234567_12345671234567_i8
# => 0o12345671234567_12345671234567_i8 doesn't fit in an Int8

12345671234567_12345671234567_i8
# => 1234567123456712345671234567 doesn't fit in an Int8

Both the underscores and the suffix are inconsistent between base-10 and all other bases.
I think both the suffix and underscores should be removed in other bases.

@oprypin
Copy link
Member

oprypin commented Nov 30, 2021

@BlobCodes You are correct that it's inconsistent and should be made consistent.

The most recent discussion was in #11447 (comment) and, although I argued to remove underscores, we ended up keeping them. By that logic, they should probably be kept in this example too. Or maybe it's worth opening yet another issue about it

@BlobCodes
Copy link
Contributor Author

BlobCodes commented Dec 1, 2021

@BlobCodes You are correct that it's inconsistent and should be made consistent.

The most recent discussion was in #11447 (comment) and, although I argued to remove underscores, we ended up keeping them. By that logic, they should probably be kept in this example too. Or maybe it's worth opening yet another issue about it

There doesn't seem to be a spec that requires the removal of underscores in base-10 numbers. We could maybe remove the suffix in the base-2^n error message, but add the underscores in base-10 representation, as it makes the number more readable.

@BlobCodes
Copy link
Contributor Author

I have now made these changes:

Before:
1.0_u32 # => unexpected token: u32
1__2 # => trailing '_' in number
0xFF_i8 # => 255 doesn't fit in an Int8
0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF # => integer literal too large
0x00000000000000000000000000000000000000 # => integer literal too large
3_000_i8 # => 3000 doesn't fit in an Int8
0xF_F_i8 # => 255 doesn't fit in an Int8

After:
1.0_u32 # => Invalid suffix u32 for decimal number
1__2 # => consecutive underscores in numbers aren't allowed
0xFF_i8 # => 0xFF doesn't fit in an Int8
0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF # => 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF doesn't fit in an UInt64
0x00000000000000000000000000000000000000 # => 0
3_000_i8 # => 3_000 doesn't fit in an Int8
0xF_F_i8 # => 0xF_F doesn't fit in an Int8

I think this should be a good compromise (no suffix for bin-/oct-/hex-numbers, underscores in base-10 numbers)
Feedback appreciated

@BlobCodes
Copy link
Contributor Author

BlobCodes commented Dec 4, 2021

The code is very high quality. I have a few comments, probably won't have many more.

I also would like to measure if, by chance, performance got worse from this, I'll do that later.

I tested a bit (VERY basic test, only tests number "-0.127389e3"):

⬢[blobcodes@toolbox crystal]$ crystal ./script.cr
Warning: benchmarking without the `--release` flag won't yield useful results
new    579.90k (  1.72µs) (± 2.60%)  976B/op   1.16× slower
old    674.69k (  1.48µs) (± 3.49%)  976B/op        fastest
⬢[blobcodes@toolbox crystal]$ crystal run ./script.cr --release
new   1.78M (560.51ns) (± 3.12%)  976B/op        fastest
old   1.76M (566.58ns) (± 2.65%)  976B/op   1.01× slower

We should probably test with more numbers, but performance loss should be negligable if any.

Code: (as zip because of github)
script.zip

spec/compiler/lexer/lexer_spec.cr Outdated Show resolved Hide resolved
spec/compiler/lexer/lexer_spec.cr Outdated Show resolved Hide resolved
spec/compiler/lexer/lexer_spec.cr Outdated Show resolved Hide resolved
src/compiler/crystal/syntax/lexer.cr Show resolved Hide resolved
@straight-shoota straight-shoota added this to the 1.3.0 milestone Dec 8, 2021
@oprypin
Copy link
Member

oprypin commented Dec 8, 2021

FYI I edited this pull request description to contain

Closes #11203
Closes #11214

because those discussions have no more unresolved decisions and this PR makes all of them happen

@BlobCodes
Copy link
Contributor Author

I found out that hexadecimal integers, which are too big to fit in a u64, but have few enough digits that they don't match or exceed the maximum digit count, trip a compiler error.
(for example crystal eval 0xfffffffffffffffff_u64 raises a compiler error)

I'll push a fix soon

@oprypin
Copy link
Member

oprypin commented Dec 8, 2021

@BlobCodes That sounds like it really needs a spec?

@straight-shoota straight-shoota removed this from the 1.3.0 milestone Dec 8, 2021
@BlobCodes
Copy link
Contributor Author

BTW I only noticed this because I locally implemented #10154, but unsurprisingly e-notation literals like 10e100u64 are easily under the 19 digit limit while exceeding a u64's capacity.

@oprypin oprypin added this to the 1.3.0 milestone Dec 9, 2021
@straight-shoota straight-shoota merged commit c0e49de into crystal-lang:master Dec 10, 2021
@straight-shoota straight-shoota changed the title Lexer number parsing refactor (int128 literals part 2) Lexer number parsing refactor Jan 3, 2022
@BlobCodes BlobCodes deleted the refactor/lexer-number-parsing branch January 29, 2022 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser topic:stdlib:numeric
Projects
None yet
6 participants