-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use sedlex instead of ulex #2203
Conversation
Sedlex uses a different module name with a slightly different API. To minimize changes, the Utf8 module from ulex is vendored as FStar_Parser_Utf8 for use in the FStar_Sedlexing module.
Compilation of FStar_Parser_LexFStar.ml hangs if op_token is represented as one regexp. Splitting op_token up into 5 regexp parts appears to fix the problem. Possibly related to: ocaml-community/sedlex#34
Unlike ulex and ocamllex, an underscore in sedlex can match an empty string. To only match single characters, the keyword "any" must be used instead. See ocaml-community/sedlex#51
This is awesome! Thank you! Sorry for not noticing it earlier, but will try to merge it soon. |
let ignored_op_char = [%sedlex.regexp? Chars ".$"] | ||
|
||
(* op_token must be splt into seperate regular expressions to prevent | ||
compliation from hanging *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's interesting ... can you comment about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you diagnose that this particular regexp was the source of the hanging problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I indeed see it hanging on
00:00:58 88 (87 ) ...parser/ml/FStar_Parser_LexFStar.ml.depends O-b-----
00:00:58 88 (87 ) ...parser/ml/FStar_Parser_LexFStar.ml.depends O-b-----
00:00:59 88 (87 ) ...parser/ml/FStar_Parser_LexFStar.ml.depends
if I revert your split into five tokens and use just a single one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you diagnose that this particular regexp was the source of the hanging problem?
It was mostly just trial and error changing things until the code compiled :)
I am not familiar with sedlex internals so I don't know the exact reason for this problem. When I have some free time, I will try to create a minimal repro and ask the sedlex maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue in the sedlex repository here: ocaml-community/sedlex#97
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! This is very helpful
| op_token_2 | ||
| op_token_3 | ||
| op_token_4 | ||
| op_token_5 -> L.lexeme lexbuf |> Hashtbl.find operators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a single large regexp is less efficient than the disjunction of five smaller ones?
Closes #1792
Based off of @fangyi-zhou's branch: https://github.com/fangyi-zhou/FStar/tree/sedlex
Moving to sedlex removes the transitive camlp4 dependency and allows F* to compile on recent OCaml versions (I tested on 4.11).
Most of the changes were straightforward syntax differences between ulex and sedlex, but there are a few noteworthy points:
Sedlex doesn't provide an equivalent
Utf8
module, so I just copied in ulex'sutf8.ml
asFStar_Parser_Utf8.ml
(license embedded in file). It could probably be rewritten differently, but I wanted to avoid changes in logic outside of the lexer.Compilation of FStar_Parser_LexFStar.ml hangs if op_token is represented as one regexp. I believe this is due to its large size. Splitting op_token up into 5 regexp parts appears to fix the problem. Possibly related to Compiler takes forever when compiling this code ocaml-community/sedlex#34
An underscore in sedlex works differently than ulex/ocamllex as it can match an empty string (see match default (_) doesn't capture anything ocaml-community/sedlex#51). It is important to use the
any
keyword instead to match a single character. I have done this in thestring
andcomment
lexer rules.Sedlex rules require a catch-all at the end of the rule. Rules that didn't have this case now include
| _ -> assert false
. I'm fairly certain that this case is impossible for thestring
andcomment
rules, but I am unsure aboutone_line_comment
andsymbolchar_parser
. Is keeping theassert false
fine or should it raise a different exception?I ran
make -C tests/micro-benchmarks
andmake -C ulib -j6
successfully with the sedlex lexer.make bench
has the following results:This PR - sedlex, OCaml 4.11:
Master - ulex, OCaml 4.09:
If I'm understanding these results correctly, the F* compiler is slightly slower with sedlex but uses less memory. The difference is small so it probably has more to do with my computer and the different OCaml compiler version.
Submission containing materials of a third party: