Lexer number parsing refactor #11211

BlobCodes · 2021-09-14T21:59:35Z

This is a refactor to the lexer number parsing.
It was seperated from #11196 to ease maintainers life's by completely seperating final int128 support from the refactor needed for this goal.

This PR fixes some bugs in current integer parsing and creates new rules.

# Before
1_.1   # => 1.1
1_e2   # => 100.0
-0u64  # => 0_u64
-0_u64 # Error: Invalid UInt32: -0 (ArgumentError); you've found a bug in the Crystal compiler.
1__2   # => 12
0x_2   # => 2
0_12   # => 12
-0x8000000000000000 # Error: Arithmetic overflow (OverflowError); you've found a bug in the Crystal compiler.
0e40   # Error: unexpected token: e40
0x     # => 0

# After
1_.1   # Error: trailing '_' in number
1_e2   # Error: trailing '_' in number
-0u64  # Error: Invalid negative value -0 for UInt64
-0_u64 # Error: Invalid negative value -0 for UInt64
1__2   # Error: trailing '_' in number
0x_2   # Error: numeric literal without digits
0_12   # Error: octal constants should be prefixed with 0o
-0x8000000000000000 # => -9223372036854775808
0e40   # => 0.0
0x     # Error: numeric literal without digits

This PR condenses the methods scan_number, scan_zero_number, check_integer_literal_fits_in_size, deduce_integer_kind, scan_bin_number, scan_octal_number, scan_hex_number into one method - scan_number

Lots of methods have been removed as they are unneeded now.
Overall, this PR removes 318 LOC from the lexer (number parsing methods now only require 197 LOC total).
Overall the code is now much easier to understand.

This PR is required for int128 literal support, because it removes the dependencies on a fixed size number type from the lexer (no static to_u64 anymore, any bit size can now be added without much effort)

Fixes #11191
Closes #11203
Closes #11214

straight-shoota · 2021-09-15T09:58:21Z

Please let's keep the discussion in this PR focused on the actual implementation.
I've created #11214 to decide about the expected behaviour.

BlobCodes · 2021-11-30T10:21:53Z

Now that #11447 is merged, the behaviour on too long hex/oct/bin integer literals is inconsistent with this branch.
This branch currently raises a "integer literal too large" error if the number is too large to even convert to base 10, and then seperately checks the base 10 whether it is too large or not.

Should I change the behavior in this branch?

oprypin · 2021-11-30T19:42:22Z

@BlobCodes Oh sorry, I didn't realize how much my work interfered with this. Luckily, as per your refactor, the code I changed will be fully thrown out in any case, so the merge isn't difficult.
So basically the git merge should be as follows:

lexer.cr: resolve all conflicts in favor of your current code
lexer_spec.cr: keep both!

But yes, all the specs I added there were intentionally decided and should not be violated.

You can consider these as real spec failures to fix :/

In any case, I will now closely follow this pull request, hope to not keep it stalled.

Failures:

  1) Lexer says syntax error on "0o1000000000000000000000i64"
     Failure/Error: assert_syntax_error "0o1000000000000000000000i64", "9223372036854775808 doesn't fit in an Int64"

       Expected message to include "9223372036854775808 doesn't fit in an Int64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:349

  2) Lexer says syntax error on "0x10000_0000_0000_0000"
     Failure/Error: assert_syntax_error "0x10000_0000_0000_0000", "0x10000_0000_0000_0000 doesn't fit in an UInt64"

       Expected message to include "0x10000_0000_0000_0000 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:355

  3) Lexer says syntax error on "0o200000_00000000_00000000"
     Failure/Error: assert_syntax_error "0o200000_00000000_00000000", "0o200000_00000000_00000000 doesn't fit in an UInt64"

       Expected message to include "0o200000_00000000_00000000 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:356

  4) Lexer says syntax error on "0b100000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000"
     Failure/Error: assert_syntax_error "0b100000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000", "0b100000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000 doesn't fit in an UInt64"

       Expected message to include "0b100000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:357

  5) Lexer says syntax error on "0x1afafafafafafafafafafaf"
     Failure/Error: assert_syntax_error "0x1afafafafafafafafafafaf", "0x1afafafafafafafafafafaf doesn't fit in an UInt64"

       Expected message to include "0x1afafafafafafafafafafaf doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:359

  6) Lexer says syntax error on "0x1afafafafafafafafafafafi32"
     Failure/Error: assert_syntax_error "0x1afafafafafafafafafafafi32", "0x1afafafafafafafafafafafi32 doesn't fit in an Int32"

       Expected message to include "0x1afafafafafafafafafafafi32 doesn't fit in an Int32" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:360

  7) Lexer says syntax error on "0o1234567123456712345671234567"
     Failure/Error: assert_syntax_error "0o1234567123456712345671234567", "0o1234567123456712345671234567 doesn't fit in an UInt64"

       Expected message to include "0o1234567123456712345671234567 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:361

  8) Lexer says syntax error on "0o12345671234567_12345671234567_i8"
     Failure/Error: assert_syntax_error "0o12345671234567_12345671234567_i8", "0o12345671234567_12345671234567_i8 doesn't fit in an Int8"

       Expected message to include "0o12345671234567_12345671234567_i8 doesn't fit in an Int8" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:362

  9) Lexer says syntax error on "0b100000000000000000000000000000000000000000000000000000000000000000"
     Failure/Error: assert_syntax_error "0b100000000000000000000000000000000000000000000000000000000000000000", "0b100000000000000000000000000000000000000000000000000000000000000000 doesn't fit in an UInt64"

       Expected message to include "0b100000000000000000000000000000000000000000000000000000000000000000 doesn't fit in an UInt64" but got "integer literal too large"

     # spec/compiler/lexer/lexer_spec.cr:363

 10) Lexer lexes "0o1000000000000000000000"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 11) Lexer lexes "0o177777_77777777_77777777"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 12) Lexer lexes "0x00ffffffffffffffff"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 13) Lexer lexes "0o001777777777777777777777"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 14) Lexer lexes "0b001111111111111111111111111111111111111111111111111111111111111111"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

 15) Lexer lexes "0o1000000000000000000000"

       integer literal too large (Crystal::SyntaxException)
         from src/compiler/crystal/syntax/lexer.cr:2916:9 in 'raise'

BlobCodes · 2021-11-30T21:57:10Z

Alright, but before resolving the conflict, I'll wait a bit until the discussions about intended integer parsing behaviour are settled.

Currently, the master and PR lexer_spec.cr have checks for the same rules, but require different error messages.

BlobCodes · 2021-11-30T22:19:07Z

Also, I think the format that the current specs demand is inconsistent.
Example:

0o12345671234567_12345671234567_i8
# => 0o12345671234567_12345671234567_i8 doesn't fit in an Int8

12345671234567_12345671234567_i8
# => 1234567123456712345671234567 doesn't fit in an Int8

Both the underscores and the suffix are inconsistent between base-10 and all other bases.
I think both the suffix and underscores should be removed in other bases.

oprypin · 2021-11-30T22:28:42Z

@BlobCodes You are correct that it's inconsistent and should be made consistent.

The most recent discussion was in #11447 (comment) and, although I argued to remove underscores, we ended up keeping them. By that logic, they should probably be kept in this example too. Or maybe it's worth opening yet another issue about it

BlobCodes · 2021-12-01T13:24:21Z

@BlobCodes You are correct that it's inconsistent and should be made consistent.

The most recent discussion was in #11447 (comment) and, although I argued to remove underscores, we ended up keeping them. By that logic, they should probably be kept in this example too. Or maybe it's worth opening yet another issue about it

There doesn't seem to be a spec that requires the removal of underscores in base-10 numbers. We could maybe remove the suffix in the base-2^n error message, but add the underscores in base-10 representation, as it makes the number more readable.

BlobCodes · 2021-12-01T14:47:40Z

I have now made these changes:

Before:
1.0_u32 # => unexpected token: u32
1__2 # => trailing '_' in number
0xFF_i8 # => 255 doesn't fit in an Int8
0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF # => integer literal too large
0x00000000000000000000000000000000000000 # => integer literal too large
3_000_i8 # => 3000 doesn't fit in an Int8
0xF_F_i8 # => 255 doesn't fit in an Int8

After:
1.0_u32 # => Invalid suffix u32 for decimal number
1__2 # => consecutive underscores in numbers aren't allowed
0xFF_i8 # => 0xFF doesn't fit in an Int8
0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF # => 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF doesn't fit in an UInt64
0x00000000000000000000000000000000000000 # => 0
3_000_i8 # => 3_000 doesn't fit in an Int8
0xF_F_i8 # => 0xF_F doesn't fit in an Int8

I think this should be a good compromise (no suffix for bin-/oct-/hex-numbers, underscores in base-10 numbers)
Feedback appreciated

spec/compiler/lexer/lexer_spec.cr

src/compiler/crystal/syntax/lexer.cr

BlobCodes · 2021-12-04T18:09:53Z

The code is very high quality. I have a few comments, probably won't have many more.

I also would like to measure if, by chance, performance got worse from this, I'll do that later.

I tested a bit (VERY basic test, only tests number "-0.127389e3"):

⬢[blobcodes@toolbox crystal]$ crystal ./script.cr
Warning: benchmarking without the `--release` flag won't yield useful results
new    579.90k (  1.72µs) (± 2.60%)  976B/op   1.16× slower
old    674.69k (  1.48µs) (± 3.49%)  976B/op        fastest
⬢[blobcodes@toolbox crystal]$ crystal run ./script.cr --release
new   1.78M (560.51ns) (± 3.12%)  976B/op        fastest
old   1.76M (566.58ns) (± 2.65%)  976B/op   1.01× slower

We should probably test with more numbers, but performance loss should be negligable if any.

Code: (as zip because of github)
script.zip

src/compiler/crystal/syntax/lexer.cr

Co-authored-by: Sijawusz Pur Rahnama <[email protected]>

spec/compiler/lexer/lexer_spec.cr

src/compiler/crystal/syntax/lexer.cr

…obCodes/crystal into refactor/lexer-number-parsing

oprypin · 2021-12-08T17:06:31Z

FYI I edited this pull request description to contain

Closes #11203
Closes #11214

because those discussions have no more unresolved decisions and this PR makes all of them happen

BlobCodes · 2021-12-08T21:37:23Z

I found out that hexadecimal integers, which are too big to fit in a u64, but have few enough digits that they don't match or exceed the maximum digit count, trip a compiler error.
(for example crystal eval 0xfffffffffffffffff_u64 raises a compiler error)

I'll push a fix soon

oprypin · 2021-12-08T21:46:54Z

@BlobCodes That sounds like it really needs a spec?

BlobCodes · 2021-12-08T21:53:50Z

BTW I only noticed this because I locally implemented #10154, but unsurprisingly e-notation literals like 10e100u64 are easily under the 19 digit limit while exceeding a u64's capacity.

Lexer number parsing refactor

de76dd9

BlobCodes mentioned this pull request Sep 14, 2021

Int128 Parsing support [PART 1] #11196

Closed

1 task

Fix numbers ending with e notation (ex. "12e")

e4688c6

HertzDevil added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler:parser topic:stdlib:numeric labels Sep 19, 2021

BlobCodes added 5 commits October 21, 2021 04:01

Merge branch 'crystal-lang:master' into refactor/lexer-number-parsing

83a00f8

Handle all "numeric literal without digits" errors in one place

241a35d

Improve error message when integer has leading '0'

e1db7da

Do not increment underscore counter on e-notation

006e06b

Fix specs

a75e352

oprypin assigned oprypin and unassigned oprypin Nov 30, 2021

oprypin self-requested a review November 30, 2021 19:17

Unify number suffix scanning

3b92b6d

beta-ziliani mentioned this pull request Dec 1, 2021

Underscores in number literals #11214

Closed

BlobCodes added 3 commits December 1, 2021 15:07

Change consecutive underscore error message

6b78d13

Refactor to apply rules in crystal-lang#11447 and fix bugs

724585c

Merge branch 'master' into refactor/lexer-number-parsing

14368ce

beta-ziliani reviewed Dec 1, 2021

View reviewed changes

spec/compiler/lexer/lexer_spec.cr Outdated Show resolved Hide resolved

Sija reviewed Dec 1, 2021

View reviewed changes

src/compiler/crystal/syntax/lexer.cr Outdated Show resolved Hide resolved

Make gen_check_int_fits_in_size less magic

13af8b7

Add "unreachable" comment

7ec5b1e

Sija reviewed Dec 4, 2021

View reviewed changes

src/compiler/crystal/syntax/lexer.cr Outdated Show resolved Hide resolved

BlobCodes and others added 3 commits December 4, 2021 21:45

Disallow number literals ending with e-notation

ea610b4

Add spec for uppercase E-notation

dabf5f3

Adapt "invalid suffix" error to compiler style

470b367

Co-authored-by: Sijawusz Pur Rahnama <[email protected]>

oprypin reviewed Dec 6, 2021

View reviewed changes

spec/compiler/lexer/lexer_spec.cr Outdated Show resolved Hide resolved

spec/compiler/lexer/lexer_spec.cr Outdated Show resolved Hide resolved

spec/compiler/lexer/lexer_spec.cr Outdated Show resolved Hide resolved

src/compiler/crystal/syntax/lexer.cr Show resolved Hide resolved

BlobCodes added 3 commits December 7, 2021 10:24

Add more invalid suffix specs

4ba109c

Use "unexpected '_' in number" error if underscore not actually trailing

4779582

Merge branch 'refactor/lexer-number-parsing' of https://github.com/Bl…

794d887

…obCodes/crystal into refactor/lexer-number-parsing

oprypin approved these changes Dec 7, 2021

View reviewed changes

straight-shoota approved these changes Dec 8, 2021

View reviewed changes

straight-shoota added this to the 1.3.0 milestone Dec 8, 2021

Remove error_number_string, rename before_prefix_pos

454c73b

Fix hex literals with 17-19 digits possibly throwing a compiler error

5be0eaa

Add specs for previous change

f092dd6

straight-shoota removed this from the 1.3.0 milestone Dec 8, 2021

oprypin approved these changes Dec 8, 2021

View reviewed changes

Remove suffix from previous spec

7ba5ddb

oprypin added this to the 1.3.0 milestone Dec 9, 2021

straight-shoota merged commit c0e49de into crystal-lang:master Dec 10, 2021

straight-shoota mentioned this pull request Dec 30, 2021

Regression in number literal parser #11671

Closed

straight-shoota changed the title ~~Lexer number parsing refactor (int128 literals part 2)~~ Lexer number parsing refactor Jan 3, 2022

BlobCodes deleted the refactor/lexer-number-parsing branch January 29, 2022 22:39

asterite mentioned this pull request Sep 14, 2022

Index out of bounds on compiling incorrect emoji symbol #12480

Closed

BlobCodes mentioned this pull request Oct 26, 2022

Remove octal and binary notation for floating point literals #12667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexer number parsing refactor #11211

Lexer number parsing refactor #11211

BlobCodes commented Sep 14, 2021 •

edited by straight-shoota

Loading

straight-shoota commented Sep 15, 2021

BlobCodes commented Nov 30, 2021

oprypin commented Nov 30, 2021

BlobCodes commented Nov 30, 2021

BlobCodes commented Nov 30, 2021

oprypin commented Nov 30, 2021

BlobCodes commented Dec 1, 2021 •

edited

Loading

BlobCodes commented Dec 1, 2021

BlobCodes commented Dec 4, 2021 •

edited

Loading

oprypin commented Dec 8, 2021 •

edited

Loading

BlobCodes commented Dec 8, 2021

oprypin commented Dec 8, 2021

BlobCodes commented Dec 8, 2021

Lexer number parsing refactor #11211

Lexer number parsing refactor #11211

Conversation

BlobCodes commented Sep 14, 2021 • edited by straight-shoota Loading

straight-shoota commented Sep 15, 2021

BlobCodes commented Nov 30, 2021

oprypin commented Nov 30, 2021

BlobCodes commented Nov 30, 2021

BlobCodes commented Nov 30, 2021

oprypin commented Nov 30, 2021

BlobCodes commented Dec 1, 2021 • edited Loading

BlobCodes commented Dec 1, 2021

BlobCodes commented Dec 4, 2021 • edited Loading

oprypin commented Dec 8, 2021 • edited Loading

BlobCodes commented Dec 8, 2021

oprypin commented Dec 8, 2021

BlobCodes commented Dec 8, 2021

BlobCodes commented Sep 14, 2021 •

edited by straight-shoota

Loading

BlobCodes commented Dec 1, 2021 •

edited

Loading

BlobCodes commented Dec 4, 2021 •

edited

Loading

oprypin commented Dec 8, 2021 •

edited

Loading