Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"The parser" #13

Open
8 tasks
nikomatsakis opened this issue Jan 18, 2018 · 12 comments
Open
8 tasks

"The parser" #13

nikomatsakis opened this issue Jan 18, 2018 · 12 comments
Labels
A-AST Area: abstract syntax tree A-lexing Area: lexical analysis A-parser Area: parsing C-enhancement Category: enhancement E-hard Difficulty: might require advanced knowledge E-help-wanted Call for participation: extra help is wanted T-compiler Relevant to compiler team

Comments

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Jan 18, 2018

It'd be worth covering

  • where the lexer is
  • how the parser works
  • conventions around error recovery
  • - What is this eat function?
    • It consumes a token, and queues up the next token as lookahead
  • How do you handle identifiers and keywords?
  • What about constructing spans?
  • What is this crazy pretty printer?
    • It dumps out the AST in human readable form.
    • We have tests that will parse each file in run-pass, pretty print it, and then test that the resulting pretty printed file type checks, so you have to implement this.
  • Something like this: "The parser" #13 (comment) and "The parser" #13 (comment)
@nikomatsakis nikomatsakis mentioned this issue Jan 18, 2018
20 tasks
@Michael-F-Bryan
Copy link
Contributor

I'd be interested in helping out with the parser section. I haven't made any contributions to the Rust parser specifically, but I've played around with syn and proc macros and also done a fair amount of parsing for other projects (toy projects, custom file formats at work, etc).

Do you know if there's anyone with experience in the parser that I'd be able to bounce questions off?

@nikomatsakis
Copy link
Contributor Author

@Michael-F-Bryan you can certainly bounce questions off of me; I would say the main owner of the parser is @petrochenkov

@nikomatsakis
Copy link
Contributor Author

Copying some stuff I wrote in #21 that seems relevant:

What things would someone need to know when either working on the parser or trying to use the parser

Hmm, good question. I imagine it'd be good to look at a few PRs that modified the parser to see what they had to do. Actually, it might be worth just including a link to those PRs in the guide to help people get oriented.

rust-lang/rust#45047 is such a PR, and here is a link to some of the relevant content.

Some things immediately jump out as worth explaining:

  • What is this eat function?
    • It consumes a token, and queues up the next token as lookahead
  • How do you handle identifiers and keywords?
    • Actually, this PR didn't have to do anything around that. I'd have to find a good example (cc @petrochenkov -- thoughts?)
  • What about constructing spans?
  • What is this crazy pretty printer?
    • It dumps out the AST in human readable form.
    • We have tests that will parse each file in run-pass, pretty print it, and then test that the resulting pretty printed file type checks, so you have to implement this.

@Michael-F-Bryan
Copy link
Contributor

I'm surprised how familiar a lot of this is! Quite a lot of parsers for toy languages I've written in Rust are effectively a state machine which incrementally consumes tokens (eat()). Your parser keeps track of where it's at (last_span) and then uses that to create a Span so you can figure out where an AST node originally came from. The "union" operation for joining spans is then a logical extension of this for when you compose larger nodes (e.g. a function body) from a bunch of smaller ones (each statement in the function).

I'm interested to find out how libsyntax deals with errors and tries to typecheck bits of your program even if you've got a syntax error elsewhere. My understanding is you break the tokens up into token trees, allowing you to "checkpoint" your progress at known points. Then if one particular token tree fails you still try to pass the rest of the (incomplete) AST to later stages... Is that somewhat close to how rustc does things?

@nikomatsakis
Copy link
Contributor Author

@Michael-F-Bryan

I'm surprised how familiar a lot of this is!

Yeah, it's your basic recursive descent parser. Nothing too fancy. =)

I'm interested to find out how libsyntax deals with errors and tries to typecheck bits of your program even if you've got a syntax error elsewhere.

This is somewhat ad-hoc, but actually one of the things I would like documented. @nrc is the one who put in a lot of that, and they may be able to comment a bit, though I think @estebank has also put in some effort here. One particular pattern I recall that @estebank has used a few times is that they "checkpoint" the state of the parser and run "tentative" parses that are then discarded -- this is used to give better error messages, basically saying something like "it looks you were writing a fn here, you need to add this keyword".

In short, though, we make a 'best effort' to recover and build up some sort of tree. We wind up reporting an error but then passing this tree back up to the rest of the code, which continues as usual. I'm not sure if there are any kind of "error nodes" in the AST, I didn't see any in a quick search, but I would expect that to appear -- we certainly use that later on. e.g. the HIR (constructed from the AST, not directly from the parser) has a node TyErr that represents an erroneous type.

@Michael-F-Bryan
Copy link
Contributor

I'm moving across a conversation I was having with @mark-i-m about how much we want to explain parser basics.

Should the parsing chapter go into more detail about lexing and parsing or rely on external sources for the basics?

I was asking myself this exact question when I wrote the start of the parser chapter. Should we add a small note up the top saying we assume people know how a basic recursive descent parser works and what tokenizing/lexical analysis is? The idea being this is a book about rustc internals, not a book an introduction to parsing.

There is already loads of good quality material on basic parsers on the internet, a couple paragraphs at the top of the chapter probably wouldn't be able to do it justice.

I agree that we shouldn't try to teach parsing here, but given that I don't expect most people to know basic parsing, I worry that it would discourage contributions... Perhaps we can

  • Add a high level overview of the algorithm and point to a few solid resources for learning in detail
  • Give some key term definitions
  • Tie them all back to the code

What do you think?


Sounds like a good idea. I was thinking we could say something like this:

Rust syntax is specified by a grammar (link) which is essentially a list of rules where each rule specifies how a piece of the language is written (e.g. a crate contains multiple items, an item may be a function declaration, a function has a bunch of statements, a statement is a ...), with each rule being written in terms of other rules or terminals (the base case, typically tokens).

Generally speaking, for each grammar rule there will be one parser method. In this way we can translate a token stream into an AST by recursively calling the appropriate method.

For people who already know how parsers work it's essentially recursive descent 101, but something like that is probably complete enough to give people the general idea, but short enough to not be off topic.

You could then tie all of this back to rustc by inspecting a sample code snippet (e.g. an if statement) and then showing what would be called when parsing it. This probably also shows how you deal with ParseSess and CodeMaps.

@mark-i-m
Copy link
Member

Copying over Niko's comment from #26 :

I agree that we shouldn't try to teach parsing here, but given that I don't expect most people to know basic parsing, I worry that it would discourage contributions... Perhaps we can

I think I agree with both of you. I don't think we want a lot of introductory material; a few links don't hurt, but not too much. But I think there's a third way, though it may take some iteration to get there: To some extent, I think you can serve both audiences by doing a kind of "walk through" of the code.

In other words, e.g. to explain tokenizing, we might point to the token data structure and give some source showing how it would be divided into tokens (we can always link to wikipedia or something too). This way, if you know what a token is, you learn about the Rust-specific parts of it. If you don't know what a token is, you can just understand it as this Rsut data structure and later learn about the more general form.

Similarly I imagine we can say something like "Rust has a recursive-descent parser" (where we link to wikipedia) and then walk through how it would parse some small example, showing a few key functions (eg., the one that parses a type). If you're not familiar with recursive descent, this will basically give you the idea, but if you are, then you'll learn about the names of key concepts in the code.

@mark-i-m mark-i-m added the E-help-wanted Call for participation: extra help is wanted label Feb 2, 2018
@mark-i-m mark-i-m added the E-hard Difficulty: might require advanced knowledge label Mar 16, 2018
@mark-i-m mark-i-m added the E-medium Difficulty: might require some prior knowledge or code reading label Jul 8, 2018
@mark-i-m
Copy link
Member

Triaged: not a whole lot of change. There was some discussion earlier with @chrissimpkins, @matklad and @Centril (who is taking some time off atm). Perhaps Chris can give an update?

@chrissimpkins
Copy link
Member

chrissimpkins commented May 19, 2020

Triaged: not a whole lot of change. There was some discussion earlier with @chrissimpkins, @matklad and @Centril (who is taking some time off atm). Perhaps Chris can give an update?

Definitely still planned. I am digging into the parser source and just submitted my first parser-area PR this week. :)

Once I understand the source well enough I will get started on the chapter if someone else does not get to it first. The conversations that I held with Aleksey and Mazdak are in our Zulip channel for anyone interested in this information. They informed the rustc-dev-guide Overview section and there should be plenty of information in there to get a start on the main lexer/parser chapter.

@chrissimpkins
Copy link
Member

chrissimpkins commented May 20, 2020

It looks like the conversation with Aleksey was an impromptu discussion on Apr 7, 2020 that occurred in a private thread. I copied the text below.

Details

Hi Aleksey! Our WG is working on a rustc Overview document for the rustc-dev-guide and I am interested in updating the lexer documentation. Do you happen to have some time in the next week that we could use to get together for 10 mins or so to discuss the lexer?

matklad12:33 PM
Yup!

12:33 PM
I actually have time right now, if that's convenient

Chris Simpkins12:34 PM
Sure, that would be great!

12:34 PM
and thanks!

12:35 PM
One sec and I will link in where we are with the Overview draft. This will be a high level summary that lives at the beginning of the Guide. I'd like to dive into more details on the lexer (and parser, I met with @Centril last week to discuss parsing) and update the lexing/parsing chapter too

12:36 PM
This is the current state of my draft of the lexer/parser information: #671

12:36 PM
Sorry wrong PR. #669 (#669)

12:38 PM
I'd like to understand the two levels of the lexer, a bit about the error handling that happens at the lexer stage, and where the best source entry points are for the documentation.

matklad12:39 PM
Sure!

12:39 PM
So, the main confusing point I imagine is "WTF Y 2 lexers?"

12:40 PM
The rustc_lexer crate is a relatively new addition

12:40 PM
the lexer/mod historically was the single lexer

12:41 PM
the problem with lexer/mod is that it did two things simultaneously

Chris Simpkins12:41 PM
There are comments about the second stage lexer leading to a "richer" token stream. Can you expand on what that means, relevance to the parser?

matklad12:42 PM
breaking the input stream into chunks (like, chars from 10 to 200 are a string literal)
12:42 PM
turning string chunks into "lowered" tokens. Like turning a "100u32" string into a 100 number
12:43 PM
This second transformation turned out to be problmenatic, as it erased some information from the original source

12:44 PM
(like, 100 and 1_00 are indistinguishable after it)

12:44 PM
And we want to preseve full-fidelity info for both IDEs and proc macros

12:45 PM
So, rustc_lexer was extracted to deal solely with the task of breaking a string into tokens. This crate is now shared between rust-analyzer and rustc

12:45 PM
The lexer/mod is still there, mostly to maintain the old "rich tokens" interface and avoid rewriting the whole parser

12:46 PM
But, in the ideal world, we get rid of lexer/mod and make the parser work directly with the tokens produced by rustc_lexer.

Chris Simpkins12:46 PM
Got it. That history is very helpful!

12:47 PM
Is there any work in the direction of deprecating that source, desire to point those who might wander across the docs to get involved in work on making a transition like that happen?

matklad12:50 PM
Kinda, but it's not really well mentored :(

12:50 PM
In particular, rust-lang/rust#63689 is part of the long-term vision

Chris Simpkins12:50 PM
OK, np. Just wanted to see if we can help to push hands in that direction through docs if it is desirable

12:52 PM
As I understand the call path issue, rustc_interface calls into the parsing code and the instantiation of the Parser struct is where the lexer is instantiated and StringReader emits tokens. Is that a fair assessment/understanding?

12:52 PM
of the connection between lexer / parser and the rustc executable

matklad12:52 PM
I think that has been restructured since I've last looked at that code

Chris Simpkins12:53 PM
Is there a good lexer entry point that we can use for the documentation, and are there more than one that are relevant to document?

matklad12:54 PM
for rustc_lexer, the whole interface is more-or-less this function: https://github.com/rust-lang/rust/blob/39b62533c7f9d0581a6ea9b9fc2cc51f21c3b5b0/src/librustc_lexer/src/lib.rs#L246

Chris Simpkins12:54 PM
Perfect! that is very helpful :slight_smile:

12:55 PM
and my last question, anything special about error handling that you feel that we should mention in the Overview? I will work on the Guide chapter later and we can go into more detail. High level documentation needs for the lexer?

matklad12:57 PM
Not really. Maybe just that rustc_lexer tries to have a small interface (so that it's easier to share with rust-analyzer), so it doens't depend directly on the diagnostic infra in rustc, and instead just provides diagnostics as plain-old-data, which are then emmited in lexer/mod as real diagnostics

Chris Simpkins12:59 PM
This has been incredibly helpful! I really appreciate your time. Thank you very much.

david0u0 pushed a commit to david0u0/rustc-guide that referenced this issue Aug 3, 2020
@jieyouxu jieyouxu added A-parser Area: parsing T-compiler Relevant to compiler team A-lexing Area: lexical analysis A-AST Area: abstract syntax tree C-enhancement Category: enhancement and removed E-medium Difficulty: might require some prior knowledge or code reading labels Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-AST Area: abstract syntax tree A-lexing Area: lexical analysis A-parser Area: parsing C-enhancement Category: enhancement E-hard Difficulty: might require advanced knowledge E-help-wanted Call for participation: extra help is wanted T-compiler Relevant to compiler team
Projects
None yet
Development

No branches or pull requests

5 participants