Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer actions and semantic predicates are executed out of order #3611

Closed
kaby76 opened this issue Mar 30, 2022 · 6 comments
Closed

Lexer actions and semantic predicates are executed out of order #3611

kaby76 opened this issue Mar 30, 2022 · 6 comments

Comments

@kaby76
Copy link
Contributor

kaby76 commented Mar 30, 2022

This concerns lexer rules that contain a mix of actions and semantic predicates within one rule. It's somewhat related to #3606 in so far as when I debugged that, I then found out this problem.

Suppose we have the following grammars:

lexer:

lexer grammar TestLexer;
@lexer::members {
	void initA()
	{
		int i = 1;
		while (true)
	{
			var text = this.InputStream.LA(-i);
			i++;
			if (text == 97) count++;
			else break;
		}
	}
	int count = 0;
}
Stuff : ( 'a'+ {initA(); } | 'b') 'c' 'd' { count == 3 }? ;

parser:

parser grammar TestParser;
options { tokenVocab=TestLexer; }
start : Stuff+ EOF ;

input:

aaacd

The expectation here is that the lexer counts the number of 'a's in the input and allows a valid token with only three 'a's. Yes, it is contrived, but it illustrates something that is a deep-order assumption that I did not know, even having used Antlr for many many years.

Unfortunately, the parser does not work. The lexer ExecATN() evaluates the semantic action first, before the action is evaluated. The reason it does this is because the semantic predicates are evaluated "on the fly", while actions are queued up and evaluated at the end of the function. I don't understand why actions are queued up anyways, and people often refer to "semantic predicates" as "actions" but of a special type.

This is not "referentially transparent" because the order of evaluating the actions is not interleaved with the semantic predicates even though the action is listed in the rule RHS before the semantic predicate. The expectation normal "users" would have is that the actions and semantic predicates are evaluated in the order as they occur on the RHS of the rule.

In this example, the rule never matches, and it is impossible to parse anything.

I did a quick search in the grammars-v4 repository to see if there are rules like this. It's not an extensive search, but there is one here.

@tianshuang
Copy link
Contributor

I've had this problem before, when I suspected I was using it incorrectly, and now it looks like it might be a bug in ANTLR4. I used action to track the currently matched question number and calculate the next question number as the boundary of the current question content, but it didn't work. I once thought that my usage was wrong.

TestParser.g4:

parser grammar TestParser;

options { tokenVocab=TestLexer; }

root
    : QUESTION+ EOF
    ;

TestLexer.g4:

lexer grammar TestLexer;

@lexer::members {
    private int startIndex = 0;
    private String currentQuestionNumberStr;
    private String nextQuestionNumberStr;

    private void updateStartIndex() {
        startIndex = getCharIndex();
    }

    private void updateQuestionNumber() {
        currentQuestionNumberStr = getText();

        System.out.println("current question number: " + currentQuestionNumberStr);

        int currentQuestionNumber = Integer.parseInt(currentQuestionNumberStr);
        int nextQuestionNumber = currentQuestionNumber + 1;

        nextQuestionNumberStr = String.valueOf(nextQuestionNumber);
    }

    private boolean reachTheNextQuestionStart() {
        if (_input.LA(1) != '\n') {
            return false;
        }

        boolean matchTarget = _input.LA(nextQuestionNumberStr.length() + 2) == '.';
        if (matchTarget) {
            for (int i = 0; i < nextQuestionNumberStr.length(); i++) {
                if (_input.LA(i + 2) != nextQuestionNumberStr.charAt(i)) {
                    return false;
                }
            }
        } else {
            return false;
        }
        return true;
    }
}

QUESTION:                      {getCharPositionInLine() == 0}? {updateStartIndex();} NUMBER {updateQuestionNumber();} DOT QUESTION_CONTENT;
OTHER:                         . -> skip;

fragment NUMBER:               [0-9]+;
fragment QUESTION_CONTENT:     .+? {reachTheNextQuestionStart()}?;
fragment SPACE:                ' ';
fragment NEWLINE:              '\n';
fragment DOT:                  '.';

TestParseTest.java:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;

public class TestParseTest {

    public static void main(String[] args) {
        CharStream charStream = CharStreams.fromString("1. hahaha\n" +
                "2. hahaha\n");
        Lexer lexer = new TestLexer(charStream);

        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        ParseTree parseTree = parser.root();

        System.out.println(parseTree.toStringTree(parser));
    }

}

The output is as follows:

Exception in thread "main" java.lang.NullPointerException
	at TestLexer.reachTheNextQuestionStart(TestLexer.java:106)
	at TestLexer.QUESTION_CONTENT_sempred(TestLexer.java:181)
	at TestLexer.sempred(TestLexer.java:167)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.evaluatePredicate(LexerATNSimulator.java:593)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.getEpsilonTarget(LexerATNSimulator.java:507)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.closure(LexerATNSimulator.java:453)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.closure(LexerATNSimulator.java:455)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.closure(LexerATNSimulator.java:455)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.closure(LexerATNSimulator.java:455)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.getReachableConfigSet(LexerATNSimulator.java:342)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.computeTargetState(LexerATNSimulator.java:277)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.execATN(LexerATNSimulator.java:204)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.matchATN(LexerATNSimulator.java:156)
	at org.antlr.v4.runtime.atn.LexerATNSimulator.match(LexerATNSimulator.java:111)
	at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:141)
	at org.antlr.v4.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:169)
	at org.antlr.v4.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:152)
	at org.antlr.v4.runtime.BufferedTokenStream.setup(BufferedTokenStream.java:254)
	at org.antlr.v4.runtime.BufferedTokenStream.lazyInit(BufferedTokenStream.java:249)
	at org.antlr.v4.runtime.CommonTokenStream.LT(CommonTokenStream.java:92)
	at org.antlr.v4.runtime.Parser.enterRule(Parser.java:628)
	at TestParser.root(TestParser.java:118)
	at TestParseTest.main(TestParseTest.java:18)

Process finished with exit code 1

The above exception indicates that the semantic prediction was executed before the action, but this is not what I want. I want the action to be executed before the final semantic prediction.

@kaby76
Copy link
Contributor Author

kaby76 commented Mar 31, 2022

  • The problem exists over multiple lexer rules too. A fold transformation of "Stuff" gives the same issue. (Note, in this example, "Other" is not supposed to be recognized as a separate token because the input "aaacd" should match "Stuff".)

    Stuff : Other 'c' 'd' { count == 3 }? ;
    Other : ( 'a'+ {initA(); } | 'b');

  • The error cannot really occur for fragment rules because actions within fragment lexer rules cause a warning. Presumably a (good) grammar writer would set warnings as errors on the build and then correct the error.

  • The "out of order" error does not happen for parser rules. The order of execution of actions and predicates are what you would expect. The reason for why it works for parsers is because the ExecATN() method for a parser does not have an "afterburner" to execute queued-up actions, whereas for a lexer it does.

parser grammar TestParser;
options { tokenVocab=TestLexer; }
@members {
	void initA()
	{
		System.Console.WriteLine("exec initA()");
		int i = 1;
		while (true)
		{
			try {
				var text = this.InputStream.LA(-i);
				i++;
				if (text == 1) count++;
				else break;
			} catch (Exception e)
			{
				break;
			}
		}
		System.Console.WriteLine("Count = " + count);
	}
	int count = 0;
}
start : stuff+ EOF ;
stuff : ( 'a'+ {initA(); } | 'b') 'c' 'd' { count == 3 }? ;
lexer grammar TestLexer;
A: 'a';
B: 'b';
C: 'c';
D: 'd';

@parrt Even if we correct #3606, actions and predicates are going to be executed out of order because actions are queued whereas semantic predicates are not. This should also be addressed (documented, checked+flagged as error, or changed to be more like a parser).

@parrt
Copy link
Member

parrt commented Apr 2, 2022

Ok, the book makes it clear but online doc does not. (@kaby76 can I send you a copy?) I'll update online doc. seems actions in lexers should only appear at the right edge of a rule so I should add an error message or something. The general rule is that predicates cannot be a function of actions that are visible when making parsing decision. Actions are never executed during parsing decision, just during parsing or lexing.

@parrt
Copy link
Member

parrt commented Apr 2, 2022

Actually it looks like some of the book has been copied in doc already:

In parser rules, predicates must appear on the left edge of alternatives to aid in alternative prediction. Lexers, on the other hand, prefer predicates on the right edge of lexer rules because they choose rules after seeing a token's entire text. Predicates in lexer rules can technically be anywhere within the rule. Some positions might be more or less efficient than others; ANTLR makes no guarantees about the optimal spot. A predicate in a lexer rule might be executed multiple times even during a single token match. You can embed multiple predicates per lexer rule and they are evaluated as the lexer reaches them during matching.

@parrt
Copy link
Member

parrt commented Apr 2, 2022

Hmm...On the other hand I see that we fixed a bug where we had actions in the middle of a lexer rule:

#469

parrt added a commit to parrt/antlr4 that referenced this issue Apr 2, 2022
…tions and semantic predicates; Fixes antlr#3611. Fixes antlr#3606.

Signed-off-by: Terence Parr <[email protected]>
KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 5, 2022
KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 9, 2022
KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 9, 2022
KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 9, 2022
KvanTTT added a commit to KvanTTT/antlr4 that referenced this issue Apr 10, 2022
@parrt parrt closed this as completed in e1455c0 Apr 10, 2022
@KvanTTT
Copy link
Member

KvanTTT commented Apr 11, 2022

It should be reopened since introducing of warning ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES is ongoing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants