Lexer actions and semantic predicates are executed out of order #3611

kaby76 · 2022-03-30T15:16:59Z

This concerns lexer rules that contain a mix of actions and semantic predicates within one rule. It's somewhat related to #3606 in so far as when I debugged that, I then found out this problem.

Suppose we have the following grammars:

lexer:

lexer grammar TestLexer;
@lexer::members {
	void initA()
	{
		int i = 1;
		while (true)
	{
			var text = this.InputStream.LA(-i);
			i++;
			if (text == 97) count++;
			else break;
		}
	}
	int count = 0;
}
Stuff : ( 'a'+ {initA(); } | 'b') 'c' 'd' { count == 3 }? ;

parser:

parser grammar TestParser;
options { tokenVocab=TestLexer; }
start : Stuff+ EOF ;

input:

aaacd

The expectation here is that the lexer counts the number of 'a's in the input and allows a valid token with only three 'a's. Yes, it is contrived, but it illustrates something that is a deep-order assumption that I did not know, even having used Antlr for many many years.

Unfortunately, the parser does not work. The lexer ExecATN() evaluates the semantic action first, before the action is evaluated. The reason it does this is because the semantic predicates are evaluated "on the fly", while actions are queued up and evaluated at the end of the function. I don't understand why actions are queued up anyways, and people often refer to "semantic predicates" as "actions" but of a special type.

This is not "referentially transparent" because the order of evaluating the actions is not interleaved with the semantic predicates even though the action is listed in the rule RHS before the semantic predicate. The expectation normal "users" would have is that the actions and semantic predicates are evaluated in the order as they occur on the RHS of the rule.

In this example, the rule never matches, and it is impossible to parse anything.

I did a quick search in the grammars-v4 repository to see if there are rules like this. It's not an extensive search, but there is one here.

The text was updated successfully, but these errors were encountered:

tianshuang · 2022-03-31T02:48:57Z

I've had this problem before, when I suspected I was using it incorrectly, and now it looks like it might be a bug in ANTLR4. I used action to track the currently matched question number and calculate the next question number as the boundary of the current question content, but it didn't work. I once thought that my usage was wrong.

TestParser.g4:

parser grammar TestParser;

options { tokenVocab=TestLexer; }

root
    : QUESTION+ EOF
    ;

TestLexer.g4:

lexer grammar TestLexer;

@lexer::members {
    private int startIndex = 0;
    private String currentQuestionNumberStr;
    private String nextQuestionNumberStr;

    private void updateStartIndex() {
        startIndex = getCharIndex();
    }

    private void updateQuestionNumber() {
        currentQuestionNumberStr = getText();

        System.out.println("current question number: " + currentQuestionNumberStr);

        int currentQuestionNumber = Integer.parseInt(currentQuestionNumberStr);
        int nextQuestionNumber = currentQuestionNumber + 1;

        nextQuestionNumberStr = String.valueOf(nextQuestionNumber);
    }

    private boolean reachTheNextQuestionStart() {
        if (_input.LA(1) != '\n') {
            return false;
        }

        boolean matchTarget = _input.LA(nextQuestionNumberStr.length() + 2) == '.';
        if (matchTarget) {
            for (int i = 0; i < nextQuestionNumberStr.length(); i++) {
                if (_input.LA(i + 2) != nextQuestionNumberStr.charAt(i)) {
                    return false;
                }
            }
        } else {
            return false;
        }
        return true;
    }
}

QUESTION:                      {getCharPositionInLine() == 0}? {updateStartIndex();} NUMBER {updateQuestionNumber();} DOT QUESTION_CONTENT;
OTHER:                         . -> skip;

fragment NUMBER:               [0-9]+;
fragment QUESTION_CONTENT:     .+? {reachTheNextQuestionStart()}?;
fragment SPACE:                ' ';
fragment NEWLINE:              '\n';
fragment DOT:                  '.';

TestParseTest.java:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;

public class TestParseTest {

    public static void main(String[] args) {
        CharStream charStream = CharStreams.fromString("1. hahaha\n" +
                "2. hahaha\n");
        Lexer lexer = new TestLexer(charStream);

        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        ParseTree parseTree = parser.root();

        System.out.println(parseTree.toStringTree(parser));
    }

}

The output is as follows:

Exception in thread "main" java.lang.NullPointerException
	at TestLexer.reachTheNextQuestionStart(TestLexer.java:106)
	at TestLexer.QUESTION_CONTENT_sempred(TestLexer.java:181)
	at TestLexer.sempred(TestLexer.java:167)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.evaluatePredicate(LexerATNSimulator.java:593)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.getEpsilonTarget(LexerATNSimulator.java:507)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.closure(LexerATNSimulator.java:453)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.closure(LexerATNSimulator.java:455)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.closure(LexerATNSimulator.java:455)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.closure(LexerATNSimulator.java:455)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.getReachableConfigSet(LexerATNSimulator.java:342)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.computeTargetState(LexerATNSimulator.java:277)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.execATN(LexerATNSimulator.java:204)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.matchATN(LexerATNSimulator.java:156)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.match(LexerATNSimulator.java:111)
	at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:141)
	at org.antlr.v4.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:169)
	at org.antlr.v4.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:152)
	at org.antlr.v4.runtime.BufferedTokenStream.setup(BufferedTokenStream.java:254)
	at org.antlr.v4.runtime.BufferedTokenStream.lazyInit(BufferedTokenStream.java:249)
	at org.antlr.v4.runtime.CommonTokenStream.LT(CommonTokenStream.java:92)
	at org.antlr.v4.runtime.Parser.enterRule(Parser.java:628)
	at TestParser.root(TestParser.java:118)
	at TestParseTest.main(TestParseTest.java:18)

Process finished with exit code 1

The above exception indicates that the semantic prediction was executed before the action, but this is not what I want. I want the action to be executed before the final semantic prediction.

kaby76 · 2022-03-31T11:37:13Z

The problem exists over multiple lexer rules too. A fold transformation of "Stuff" gives the same issue. (Note, in this example, "Other" is not supposed to be recognized as a separate token because the input "aaacd" should match "Stuff".)

Stuff : Other 'c' 'd' { count == 3 }? ;
Other : ( 'a'+ {initA(); } | 'b');
The error cannot really occur for fragment rules because actions within fragment lexer rules cause a warning. Presumably a (good) grammar writer would set warnings as errors on the build and then correct the error.
The "out of order" error does not happen for parser rules. The order of execution of actions and predicates are what you would expect. The reason for why it works for parsers is because the ExecATN() method for a parser does not have an "afterburner" to execute queued-up actions, whereas for a lexer it does.

parser grammar TestParser;
options { tokenVocab=TestLexer; }
@members {
	void initA()
	{
		System.Console.WriteLine("exec initA()");
		int i = 1;
		while (true)
		{
			try {
				var text = this.InputStream.LA(-i);
				i++;
				if (text == 1) count++;
				else break;
			} catch (Exception e)
			{
				break;
			}
		}
		System.Console.WriteLine("Count = " + count);
	}
	int count = 0;
}
start : stuff+ EOF ;
stuff : ( 'a'+ {initA(); } | 'b') 'c' 'd' { count == 3 }? ;

lexer grammar TestLexer;
A: 'a';
B: 'b';
C: 'c';
D: 'd';

@parrt Even if we correct #3606, actions and predicates are going to be executed out of order because actions are queued whereas semantic predicates are not. This should also be addressed (documented, checked+flagged as error, or changed to be more like a parser).

parrt · 2022-04-02T20:03:25Z

Ok, the book makes it clear but online doc does not. (@kaby76 can I send you a copy?) I'll update online doc. seems actions in lexers should only appear at the right edge of a rule so I should add an error message or something. The general rule is that predicates cannot be a function of actions that are visible when making parsing decision. Actions are never executed during parsing decision, just during parsing or lexing.

parrt · 2022-04-02T20:12:09Z

Actually it looks like some of the book has been copied in doc already:

In parser rules, predicates must appear on the left edge of alternatives to aid in alternative prediction. Lexers, on the other hand, prefer predicates on the right edge of lexer rules because they choose rules after seeing a token's entire text. Predicates in lexer rules can technically be anywhere within the rule. Some positions might be more or less efficient than others; ANTLR makes no guarantees about the optimal spot. A predicate in a lexer rule might be executed multiple times even during a single token match. You can embed multiple predicates per lexer rule and they are evaluated as the lexer reaches them during matching.

parrt · 2022-04-02T20:28:56Z

Hmm...On the other hand I see that we fixed a bug where we had actions in the middle of a lexer rule:

#469

…tions and semantic predicates; Fixes antlr#3611. Fixes antlr#3606. Signed-off-by: Terence Parr <[email protected]>

KvanTTT · 2022-04-11T09:16:29Z

It should be reopened since introducing of warning ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES is ongoing.

parrt mentioned this issue Apr 2, 2022

Lexer.getCharIndex() return value not behaving as expected #3606

Closed

4 tasks

parrt added a commit to parrt/antlr4 that referenced this issue Apr 2, 2022

update doc to be more explicit about the interaction between lexer ac…

8e5f2ff

…tions and semantic predicates; Fixes antlr#3611. Fixes antlr#3606. Signed-off-by: Terence Parr <[email protected]>

KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 5, 2022

New warning ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES, fix antlr#3611

cbd3bc9

KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 9, 2022

New warning ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES, fix antlr#3611

0b093bd

KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 9, 2022

New warning ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES, fix antlr#3611

9251718

KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 9, 2022

New warning ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES, fix antlr#3611

b24b6e5

KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 10, 2022

New warning ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES, fix antlr#3611

939113f

parrt closed this as completed in e1455c0 Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexer actions and semantic predicates are executed out of order #3611

Lexer actions and semantic predicates are executed out of order #3611

kaby76 commented Mar 30, 2022 •

edited

Loading

tianshuang commented Mar 31, 2022

kaby76 commented Mar 31, 2022 •

edited

Loading

parrt commented Apr 2, 2022

parrt commented Apr 2, 2022

parrt commented Apr 2, 2022

KvanTTT commented Apr 11, 2022 •

edited

Loading

Lexer actions and semantic predicates are executed out of order #3611

Lexer actions and semantic predicates are executed out of order #3611

Comments

kaby76 commented Mar 30, 2022 • edited Loading

tianshuang commented Mar 31, 2022

kaby76 commented Mar 31, 2022 • edited Loading

parrt commented Apr 2, 2022

parrt commented Apr 2, 2022

parrt commented Apr 2, 2022

KvanTTT commented Apr 11, 2022 • edited Loading

kaby76 commented Mar 30, 2022 •

edited

Loading

kaby76 commented Mar 31, 2022 •

edited

Loading

KvanTTT commented Apr 11, 2022 •

edited

Loading