Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sedlex instead of ulex #2203

Merged
merged 7 commits into from
Mar 11, 2021
Merged

Use sedlex instead of ulex #2203

merged 7 commits into from
Mar 11, 2021

Conversation

mnxn
Copy link
Contributor

@mnxn mnxn commented Jan 4, 2021

Closes #1792

Based off of @fangyi-zhou's branch: https://github.com/fangyi-zhou/FStar/tree/sedlex

Moving to sedlex removes the transitive camlp4 dependency and allows F* to compile on recent OCaml versions (I tested on 4.11).

Most of the changes were straightforward syntax differences between ulex and sedlex, but there are a few noteworthy points:

  • Sedlex doesn't provide an equivalent Utf8 module, so I just copied in ulex's utf8.ml as FStar_Parser_Utf8.ml (license embedded in file). It could probably be rewritten differently, but I wanted to avoid changes in logic outside of the lexer.

  • Compilation of FStar_Parser_LexFStar.ml hangs if op_token is represented as one regexp. I believe this is due to its large size. Splitting op_token up into 5 regexp parts appears to fix the problem. Possibly related to Compiler takes forever when compiling this code ocaml-community/sedlex#34

  • An underscore in sedlex works differently than ulex/ocamllex as it can match an empty string (see match default (_) doesn't capture anything ocaml-community/sedlex#51). It is important to use the any keyword instead to match a single character. I have done this in the string and comment lexer rules.

  • Sedlex rules require a catch-all at the end of the rule. Rules that didn't have this case now include | _ -> assert false. I'm fairly certain that this case is impossible for the string and comment rules, but I am unsure about one_line_comment and symbolchar_parser. Is keeping the assert false fine or should it raise a different exception?

I ran make -C tests/micro-benchmarks and make -C ulib -j6 successfully with the sedlex lexer. make bench has the following results:

This PR - sedlex, OCaml 4.11:

N_benches total user system mem(mb)
micro-benchmarks 229 57.6017 37.341 15.7662 23521214
ulib 288 1833.95 1764.36 48.22 73497874
ocaml_extract 95 140.748 125.184 13.6787 18332827

Master - ulex, OCaml 4.09:

N_benches total user system mem(mb)
micro-benchmarks 229 60.1398 39.9449 15.6857 23607410
ulib 288 1799.14 1714.82 43.023 74029871
ocaml_extract 95 135.334 121.343 12.6106 18295119

If I'm understanding these results correctly, the F* compiler is slightly slower with sedlex but uses less memory. The difference is small so it probably has more to do with my computer and the different OCaml compiler version.

Submission containing materials of a third party:

fangyi-zhou and others added 6 commits January 3, 2021 22:07
Sedlex uses a different module name with a slightly different API.
To minimize changes, the Utf8 module from ulex is vendored as
FStar_Parser_Utf8 for use in the FStar_Sedlexing module.
Compilation of FStar_Parser_LexFStar.ml hangs if op_token is represented
as one regexp. Splitting op_token up into 5 regexp parts appears to fix
the problem.

Possibly related to: ocaml-community/sedlex#34
Unlike ulex and ocamllex, an underscore in sedlex can match an empty
string. To only match single characters, the keyword "any" must be used
instead.

See ocaml-community/sedlex#51
@ghost
Copy link

ghost commented Jan 4, 2021

CLA assistant check
All CLA requirements met.

@nikswamy
Copy link
Collaborator

This is awesome! Thank you! Sorry for not noticing it earlier, but will try to merge it soon.

let ignored_op_char = [%sedlex.regexp? Chars ".$"]

(* op_token must be splt into seperate regular expressions to prevent
compliation from hanging *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's interesting ... can you comment about this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you diagnose that this particular regexp was the source of the hanging problem?

Copy link
Collaborator

@nikswamy nikswamy Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I indeed see it hanging on

00:00:58 88 (87 ) ...parser/ml/FStar_Parser_LexFStar.ml.depends O-b-----
00:00:58 88 (87 ) ...parser/ml/FStar_Parser_LexFStar.ml.depends O-b-----
00:00:59 88 (87 ) ...parser/ml/FStar_Parser_LexFStar.ml.depends

if I revert your split into five tokens and use just a single one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you diagnose that this particular regexp was the source of the hanging problem?

It was mostly just trial and error changing things until the code compiled :)

I am not familiar with sedlex internals so I don't know the exact reason for this problem. When I have some free time, I will try to create a minimal repro and ask the sedlex maintainers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue in the sedlex repository here: ocaml-community/sedlex#97

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! This is very helpful

| op_token_2
| op_token_3
| op_token_4
| op_token_5 -> L.lexeme lexbuf |> Hashtbl.find operators
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a single large regexp is less efficient than the disjunction of five smaller ones?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate from ulex to sedlex
3 participants