Sublexers using callbacks in documentation #61

mikolajpp · 2018-12-15T00:08:44Z

In the end I am using the following construct for dispatching sublexer.

fn sublexer<'source, Src: Source<'source>, Subtok>(skip: usize, lex: &mut Lexer<Token, Src>) 
-> Vec<Subtok>
where Subtok: Logos + PartialEq + Clone  + fmt::Debug
{

    let range = (lex.range().start + skip)..lex.source.len();
    let slice = lex.source.slice(range).unwrap();

    let mut sublex = Subtok::lexer(str::from_utf8(&slice.as_bytes()).unwrap());
    let mut tokens = vec![];

    sublex.advance();

    println!("Sublexing...");

    while sublex.token != Subtok::END {

        println!("Got {:?} for {}", sublex.token, sublex.slice());
        if sublex.token == Subtok::ERROR {
            return tokens
        }

        tokens.push(sublex.token.clone());
        sublex.advance();
    }
    tokens
}

If you think it is worth it, we might want to put some short writeup in the docs about
this.

The text was updated successfully, but these errors were encountered:

mikolajpp · 2018-12-15T00:42:20Z

This does not handle bumping top-level lexer, but it is easy to do that tracking farthest end of current sublexer token.

maciejhirsz · 2018-12-15T10:33:48Z

I think it might be reasonable to add some API support for this to be less hacky.

mikolajpp · 2018-12-16T06:12:39Z

Yes, for sure. I was thinking about hacking

Intricate(ComplexToken)

into Logos, so that Intricate would fire ComplexToken lexer, but recently I am short on time for getting enough rapport with the codebase. This would require some substantial changes, as for now Token variants are supposed to be strictly without fields.

And from cursory thinking about this it is not clear for me what solution is the best. Should we have sublexers
parse as long as they can? This is the way I have it working now, since some sublexers do not have natural ending token (think some embedded templating language). Alternatively, the parsing can be done just using advance as for top level lexer. A third way would be to have an API return ready made sublexers to be handled by the user.

Once my lexer settles down a bit I will compile a list of things that I needed hacks to get done, and perhaps that could give us some useful pointers, as in terms of complexity and quirks I believe it is testing Logos quite extensively as indicated by the number of bugs smoked out ;-)

maciejhirsz · 2018-12-16T09:01:27Z

I was thinking something in the lines of:

let sublexer = lexer.context_switch::<SubToken>();
// or
let sublexer = SubToken::sublexer(lexer);

I don't have a use case for this, so I'm not sure if the sublexer should consume the lexer (so that if you want to go back to parsing higher level tokens you context_switch back to original token), or whether it should just take the original Lexer by reference and just copy over the source and current indices.

CAD97 · 2018-12-17T03:13:41Z

Another possibility I was considering could layer on top of Logos if it provides something like Lexer::jump(usize), where you just keep multiple lexers and jump them to the correct spot (found with Lexer::range()) when you need to switch between them.

The rest of this post will now be a few flow-of-consciousness style thoughts on what Logos could do to support lexer modes / sublexers.

When talking about lexer modes, I always have to link When Are Lexer Modes Useful. It discusses lexer modes and gives some very clear examples of when they're really useful. This author is actually the one that convinced me that modal pull-style lexers are worth still using rather than a fully lexerless style.

Excerpt

Consider string literals with \ escapes:
>>> print("Hello W\u00f4rld\n")
Hello Wôrld
Although there is lexical structure within the double quotes, a lexer mode is not required here. C-style string literals usually scanned with hand-written code, and treated as a single token. I prefer to treat \n and \u00f4 as separate tokens, but that's not the way it's usually done.

Lexer modes become useful when you have a recursive language within double quotes, as shell does:
$ echo "$((1 + 2 * 3)) equals 7"
7 equals 7
This shell-like syntax appears in JavaScript as of version 6:
> console.log(`${1 + 2 * 3} equals 7`)
7 equals 7
as well as in Apple's Swift language:
> print("\(1 + 2 * 3) equals 7")
7 equals 7
This is the most compelling use case for lexer modes because the language inside strings is not only recursive, but it's mutually recursive with the "main" language. The same expressions can appear both inside and outside the string literal.

Modal lexers are great for one key language feature that's growing in popularity: language composition. To use my own language as an example, string interpolation with \{ and }. The expression "The sum of the list [1, 2, 3] is \{ List::of(1, 2, 3).fold(0) { acc, i -> acc + i } }" cannot be lexed by a simple lexer. The reason is that the grammar is mutually recursive: any valid expression can be in-between \{ and } in the string, which includes both {} blocks and "" strings.

The solution to this is to lex this as STRING_START in the expression grammar, switch to the string grammar, lex TEXT, START_INTERPOLATION, then continue again with the expression grammar. This is all driven by the parser asking the lexer for the correct "mode" of lexed token from the modal lexer.

So how might Logos support lexer modes? The properties we'd like to have are:

Support choosing lexer mode from the driver [CRITICAL]
Not advancing past trivia (whitespace) from the wrong mode [CRITICAL]
Separate token types for different modes for maximum type safety [NICE]
Avoiding doing speculative lexing in the wrong mode [NICE]
Ability to keep a modal lexer within a larger external iterator [IDEAL]

Note that a stack of modes is not required; this should instead be handled by your parser's stack, and the parser should always know what mode to query from the lexer. So what possibilities are there for representing modes? (Based purely on API, ignoring implementation.)

All examples use the simple grammar of "strings" (fake specification grammar):

mode outer:
"\"" => StartString;
r"\p{White_Space}" => WhiteSpace;

mode inner:
r"[^\\"]+" => Text;
"\\n" => EscapedNewline;
r"\\u{[^}]*}" => EscapedCodepoint;
"\\\"" => EscapedQuote;
"\"" => EndString;

Single token enum

The simplest form, and how internal lexer modes (like the ones ANTLR provides) are handled. Gives [1], [2], [4], [5].

enum Token {
    // Shared between modes, must be the only variants before a #[mode] marker if any present
    #[end]
    End,
    #[error]
    Error,

    #[mode = "0"]
    #[token = "\""]
    StartString,
    #[regex = r"\p{White_Space}"]
    WhiteSpace,

    #[mode = "1"]
    #[regex = r"[^\\"]+"]
    Text,
    #[token = r"\n"]
    EscapedNewline,
    #[regex = r"\\u{[^}]*}"]
    EscapedCodepoint,
    #[token = r#"\""#]
    EscapedQuote,
    #[token = "\""]
    EndString,
}

let s = r#"Hello W\u{00f4}rld\n""#;
let mut lexer = Token::lexer(s);

// we start without lexing anything
assert_eq!(lexer.token, Token::Error);
assert!(ptr::eq(lexer.slice(), &s[0..0])); // Note: ptr::eq checks fat ptr metadata

lexer.advance_mode(0); // Lexer::advance(&mut self) => self.advance_mode(0);
                       // (convenience for modeless lexers, with modeless => mode=0

assert_eq!(lexer.token, Token::StartString);

// We've entered a string, parser starts requesting mode=1
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::Text);
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::EscapedCodepoint);
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::Text);
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::EscapedNewline);
lexer.advance_mode(1);
assert_eq!(lexer.token, Token::EndString);

// We've exited the string, parser starts requesting mode=0
lexer.advance_mode(0);
assert_eq!(lexer.token, Token::End);

Sublexer borrows main lexer, updates it in Drop

Gives [1], [3], also [2], [4] if careful. Enforces stacking of modes.

Can't be stored in a (safe) pull iterator; would require self borrowing. (Works fine for a push parser doing the whole thing in one pass.) Advancing the base lexer before creating the sublexer incorrectly skips trivia in the base lexer and lexes the start of the subgrammar in the base lexer. Can be fixed by creating the sublexer after the current base lexer token, updating the base lexer to after the current sublexer token in Drop (reacquires [2], [4]). This would mean doing let mut sublexer = Token::sublexer(lexer) instead of lexer.advance

enum Outer {
    #[end]
    End,
    #[error]
    Error,

    #[token = "\""]
    StartString,
    #[regex = r"\p{White_Space}"]
    WhiteSpace,
}

enum Inner {
    #[end]
    End,
    #[error]
    Error,

    #[regex = r"[^\\"]+"]
    Text,
    #[token = r"\n"]
    EscapedNewline,
    #[regex = r"\\u{[^}]*}"]
    EscapedCodepoint,
    #[token = r#"\""#]
    EscapedQuote,
    #[token = "\""]
    EndString,
}

let s = r#"Hello W\u{00f4}rld\n""#;
let mut outer = Outer::lexer(s);

// we start as today, having lexed the first token
assert_eq!(outer.token, Outer::StartQuote);

// We've entered a string, parser creates sublexer
let mut inner = Inner::sublexer(&mut lexer);
assert_eq!(inner.token, Inner::Text);
inner.advance();
assert_eq!(inner.token, Inner::EscapedCodepoint);
inner.advance();
assert_eq!(inner.token, Inner::Text);
inner.advance();
assert_eq!(inner.token, Inner::EscapedNewline);
inner.advance();
assert_eq!(inner.token, Inner::EndString);

// We've exited the string, parser returns to outer lexer
drop(inner);
assert_eq!(outer.token, Outer::End);

Lexer can change its token type

Gives [1], [3], [5], and with care, [2], [4]. My personal favorite.

Rough implementation sketch:

impl<'s, T, S> Lexer<T, S>
where
    T: Logos,
    S: Source<'s>,
{
    // disclaimer: untested, pulled from out of nowhere
    // Takes by-value as a lint; could be by-ref
    pub fn morph<T2: Logos>(self) -> Lexer<T2, S> {
        Lexer {
            source: self.source,
            token: T2::ERROR,
            extras: Default::default(),
            token_start: self.token_end,
            token_end: self.token_end,
        }
    }

    pub fn advance_as<T2: Logos>(self) -> Lexer<T2, S> {
        let mut lex = self.morph();
        lex.advance();
        lex
    }
}

Example:

enum Outer {
    #[end]
    End,
    #[error]
    Error,

    #[token = "\""]
    StartString,
    #[regex = r"\p{White_Space}"]
    WhiteSpace,
}

enum Inner {
    #[end]
    End,
    #[error]
    Error,

    #[regex = r"[^\\"]+"]
    Text,
    #[token = r"\n"]
    EscapedNewline,
    #[regex = r"\\u{[^}]*}"]
    EscapedCodepoint,
    #[token = r#"\""#]
    EscapedQuote,
    #[token = "\""]
    EndString,
}

let s = r#"Hello W\u{00f4}rld\n""#;
let mut outer = Outer::lexer(s);

// we start as today, having lexed the first token
assert_eq!(outer.token, Outer::StartQuote);

// We've entered a string, parser creates sublexer
let mut inner = outer.advance_as();
assert_eq!(inner.token, Inner::Text);
inner.advance();
assert_eq!(inner.token, Inner::EscapedCodepoint);
inner.advance();
assert_eq!(inner.token, Inner::Text);
inner.advance();
assert_eq!(inner.token, Inner::EscapedNewline);
inner.advance();
assert_eq!(inner.token, Inner::EndString);

// We've exited the string, parser returns to outer lexer
outer = inner.advance_as();
assert_eq!(outer.token, Outer::End);

I think the third option meets all the desired requirements (except for requiring a .clone() if you've received a &mut Lexer to work with, but I think this syntactic salt is a good thing, as it will help you remember to update the old lexer since it isn't done automatically). I'm actually convinced enough that I'm going to go make a PR.

maciejhirsz added the enhancement New feature or request label Dec 15, 2018

CAD97 mentioned this issue Dec 17, 2018

Mighty Morphin Logos Lexer #65

Merged

maciejhirsz added handbook and removed enhancement New feature or request labels Mar 2, 2019

jeertmans linked a pull request Jun 25, 2023 that will close this issue

chore(book): first Logo's handbook version #319

Merged

elenakrittik mentioned this issue Sep 5, 2023

Help needed: matching Python-like multiline strings #330

Closed

jeertmans closed this as completed in #319 Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sublexers using callbacks in documentation #61

Sublexers using callbacks in documentation #61

mikolajpp commented Dec 15, 2018

mikolajpp commented Dec 15, 2018

maciejhirsz commented Dec 15, 2018

mikolajpp commented Dec 16, 2018

maciejhirsz commented Dec 16, 2018

CAD97 commented Dec 17, 2018

Sublexers using callbacks in documentation #61

Sublexers using callbacks in documentation #61

Comments

mikolajpp commented Dec 15, 2018

mikolajpp commented Dec 15, 2018

maciejhirsz commented Dec 15, 2018

mikolajpp commented Dec 16, 2018

maciejhirsz commented Dec 16, 2018

CAD97 commented Dec 17, 2018