-
-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sublexers using callbacks in documentation #61
Comments
This does not handle bumping top-level lexer, but it is easy to do that tracking farthest end of current sublexer token. |
I think it might be reasonable to add some API support for this to be less hacky. |
Yes, for sure. I was thinking about hacking
into Logos, so that Intricate would fire ComplexToken lexer, but recently I am short on time for getting enough rapport with the codebase. This would require some substantial changes, as for now Token variants are supposed to be strictly without fields. And from cursory thinking about this it is not clear for me what solution is the best. Should we have sublexers Once my lexer settles down a bit I will compile a list of things that I needed hacks to get done, and perhaps that could give us some useful pointers, as in terms of complexity and quirks I believe it is testing Logos quite extensively as indicated by the number of bugs smoked out ;-) |
I was thinking something in the lines of:
I don't have a use case for this, so I'm not sure if the sublexer should consume the lexer (so that if you want to go back to parsing higher level tokens you |
Another possibility I was considering could layer on top of Logos if it provides something like The rest of this post will now be a few flow-of-consciousness style thoughts on what Logos could do to support lexer modes / sublexers. When talking about lexer modes, I always have to link When Are Lexer Modes Useful. It discusses lexer modes and gives some very clear examples of when they're really useful. This author is actually the one that convinced me that modal pull-style lexers are worth still using rather than a fully lexerless style. Excerpt
Modal lexers are great for one key language feature that's growing in popularity: language composition. To use my own language as an example, string interpolation with The solution to this is to lex this as So how might Logos support lexer modes? The properties we'd like to have are:
Note that a stack of modes is not required; this should instead be handled by your parser's stack, and the parser should always know what mode to query from the lexer. So what possibilities are there for representing modes? (Based purely on API, ignoring implementation.) All examples use the simple grammar of "strings" (fake specification grammar):
Single token enumThe simplest form, and how internal lexer modes (like the ones ANTLR provides) are handled. Gives [1], [2], [4], [5]. enum Token {
// Shared between modes, must be the only variants before a #[mode] marker if any present
#[end]
End,
#[error]
Error,
#[mode = "0"]
#[token = "\""]
StartString,
#[regex = r"\p{White_Space}"]
WhiteSpace,
#[mode = "1"]
#[regex = r"[^\\"]+"]
Text,
#[token = r"\n"]
EscapedNewline,
#[regex = r"\\u{[^}]*}"]
EscapedCodepoint,
#[token = r#"\""#]
EscapedQuote,
#[token = "\""]
EndString,
}
let s = r#"Hello W\u{00f4}rld\n""#;
let mut lexer = Token::lexer(s);
// we start without lexing anything
assert_eq!(lexer.token, Token::Error);
assert!(ptr::eq(lexer.slice(), &s[0..0])); // Note: ptr::eq checks fat ptr metadata
lexer.advance_mode(0); // Lexer::advance(&mut self) => self.advance_mode(0);
// (convenience for modeless lexers, with modeless => mode=0
assert_eq!(lexer.token, Token::StartString);
// We've entered a string, parser starts requesting mode=1
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::Text);
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::EscapedCodepoint);
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::Text);
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::EscapedNewline);
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::EndString);
// We've exited the string, parser starts requesting mode=0
lexer.advance_mode(0);
assert_eq!(lexer.token, Token::End); Sublexer borrows main lexer, updates it in DropGives [1], [3], also [2], [4] if careful. Enforces stacking of modes. Can't be stored in a (safe) pull iterator; would require self borrowing. (Works fine for a push parser doing the whole thing in one pass.) Advancing the base lexer before creating the sublexer incorrectly skips trivia in the base lexer and lexes the start of the subgrammar in the base lexer. Can be fixed by creating the sublexer after the current base lexer token, updating the base lexer to after the current sublexer token in enum Outer {
#[end]
End,
#[error]
Error,
#[token = "\""]
StartString,
#[regex = r"\p{White_Space}"]
WhiteSpace,
}
enum Inner {
#[end]
End,
#[error]
Error,
#[regex = r"[^\\"]+"]
Text,
#[token = r"\n"]
EscapedNewline,
#[regex = r"\\u{[^}]*}"]
EscapedCodepoint,
#[token = r#"\""#]
EscapedQuote,
#[token = "\""]
EndString,
}
let s = r#"Hello W\u{00f4}rld\n""#;
let mut outer = Outer::lexer(s);
// we start as today, having lexed the first token
assert_eq!(outer.token, Outer::StartQuote);
// We've entered a string, parser creates sublexer
let mut inner = Inner::sublexer(&mut lexer);
assert_eq!(inner.token, Inner::Text);
inner.advance();
assert_eq!(inner.token, Inner::EscapedCodepoint);
inner.advance();
assert_eq!(inner.token, Inner::Text);
inner.advance();
assert_eq!(inner.token, Inner::EscapedNewline);
inner.advance();
assert_eq!(inner.token, Inner::EndString);
// We've exited the string, parser returns to outer lexer
drop(inner);
assert_eq!(outer.token, Outer::End); Lexer can change its token typeGives [1], [3], [5], and with care, [2], [4]. My personal favorite. Rough implementation sketch: impl<'s, T, S> Lexer<T, S>
where
T: Logos,
S: Source<'s>,
{
// disclaimer: untested, pulled from out of nowhere
// Takes by-value as a lint; could be by-ref
pub fn morph<T2: Logos>(self) -> Lexer<T2, S> {
Lexer {
source: self.source,
token: T2::ERROR,
extras: Default::default(),
token_start: self.token_end,
token_end: self.token_end,
}
}
pub fn advance_as<T2: Logos>(self) -> Lexer<T2, S> {
let mut lex = self.morph();
lex.advance();
lex
}
} Example: enum Outer {
#[end]
End,
#[error]
Error,
#[token = "\""]
StartString,
#[regex = r"\p{White_Space}"]
WhiteSpace,
}
enum Inner {
#[end]
End,
#[error]
Error,
#[regex = r"[^\\"]+"]
Text,
#[token = r"\n"]
EscapedNewline,
#[regex = r"\\u{[^}]*}"]
EscapedCodepoint,
#[token = r#"\""#]
EscapedQuote,
#[token = "\""]
EndString,
}
let s = r#"Hello W\u{00f4}rld\n""#;
let mut outer = Outer::lexer(s);
// we start as today, having lexed the first token
assert_eq!(outer.token, Outer::StartQuote);
// We've entered a string, parser creates sublexer
let mut inner = outer.advance_as();
assert_eq!(inner.token, Inner::Text);
inner.advance();
assert_eq!(inner.token, Inner::EscapedCodepoint);
inner.advance();
assert_eq!(inner.token, Inner::Text);
inner.advance();
assert_eq!(inner.token, Inner::EscapedNewline);
inner.advance();
assert_eq!(inner.token, Inner::EndString);
// We've exited the string, parser returns to outer lexer
outer = inner.advance_as();
assert_eq!(outer.token, Outer::End); I think the third option meets all the desired requirements (except for requiring a |
In the end I am using the following construct for dispatching sublexer.
If you think it is worth it, we might want to put some short writeup in the docs about
this.
The text was updated successfully, but these errors were encountered: