Grammar: only un-escape \\ in situations where escaping is required #708

odelalleau · 2021-05-06T13:22:46Z

This simplifies situations where one wants to obtain a string containing multiple backslashes.

The situations where \\ needs to be un-escaped into \ are:

In top-level strings: when they precede a (potentially escaped) interpolation, e.g. \\${foo} (=> \<value_of_foo>) or \\\${foo} (-> \${foo})
In quoted strings: in the above situation or when they precede a (potentially escaped) quote, e.g.: "\\" (=> \), or "\\\"" (=> \")

In all other situations, \\ is left untouched, e.g.: "abc\\def" (quoted string => abc\\def), or \\foo_${bar} (top-level string => \\foo_<value_of_bar>)

Implementation note: I considered several alternatives and ended up doing most of the work at the lexer level. It makes the lexer more complex but has the advantage of leaving the un-escaping code the same (up to a minor change due to the way I extended the escaped interpolation token ESC_INTER). The other two alternatives I considered were:

Keeping the simpler lexer rules but making the un-escaping code in the visitor context-dependent (based on whether we are in a top-level, quoted or unquoted string, and what is the next token). This would essentially shift the complexity from the lexer to the visitor.
Including the \\ preceding an interpolation (or the closing quote in a quoted string) in the same token, e.g. TOP_INTER_OPEN: ESC_BACKSLASH* INTER_OPEN -> type(INTER_OPEN), pushMode(INTERPOLATION_MODE);. This would avoid having to use the lookahead predicate I am using right now. The main concern I have with this approach is that it makes the grammar confusing, because now an interpolation can be a regular interpolation ${foo}, or an interpolation preceded by backslashes \\${foo}, and this has a very different meaning at the top-level (the first one is a direct node interpolation while the second one is a string interpolation). I think that explicitly splitting the escaped \\ from the interpolation is cleaner.

This simplifies situations where one wants to obtain a string containing multiple backslashes.

odelalleau · 2021-05-06T15:57:19Z

Just a note that with this PR, the user-facing behavior is now almost the same as what I had in #621, with the exception that (i) here we can write a quoted string "${foo:"bar"}" without escaping the inner quotes, and (ii) the validation of quoted strings' content is now performed on assignment instead of on access. Compared in #621, in terms of implementation:

The lexer is more complex (extra modes for quoted strings, lookahead predicates to identify which \\ to un-escape)
The visitor is simpler (no more quoted string callback, no involved logic to figure out which \\ need un-escaping)

Personally I'd rather keep this PR rather than revert back to #621, but I thought I'd mention it to make the trade-offs clear.

omry

Keeping the simpler lexer rules but making the un-escaping code in the visitor context-dependent (based on whether we are in a top-level, quoted or unquoted string, and what is the next token). This would essentially shift the complexity from the lexer to the visitor.

I think I prefer this approach as it keeps the generated code portable (it could be used in the future to create OmegaConf for different languages).
Can you explain why this logic would be sensitive to all these conditions? (in a top-level, quoted or unquoted string, and what is the next token)

Including the \ preceding an interpolation (or the closing quote in a quoted string) in the same token, e.g. TOP_INTER_OPEN: ESC_BACKSLASH* INTER_OPEN -> type(INTER_OPEN), pushMode(INTERPOLATION_MODE);. This would avoid having to use the lookahead predicate I am using right now. The main concern I have with this approach is that it makes the grammar confusing, because now an interpolation can be a regular interpolation ${foo}, or an interpolation preceded by backslashes \${foo}, and this has a very different meaning at the top-level (the first one is a direct node interpolation while the second one is a string interpolation). I think that explicitly splitting the escaped \ from the interpolation is cleaner.

This does have some appeal, but yeah - it's raising too much dust up the stack. we should probably avoid it.

omegaconf/grammar/OmegaConfGrammarLexer.g4

omry · 2021-05-06T19:06:21Z

omegaconf/grammar/OmegaConfGrammarLexer.g4

+
+// Escaped backslashes: even number of \ followed by an interpolation.
+// The interpolation must *not* be matched by this rule (this is why we use a predicate lookahead).
+TOP_ESC: ESC_BACKSLASH+ { self._input.LA(1) == ord("$") and self._input.LA(2) == ord("{") }? -> type(ESC);


This is the first time we are adding custom code to ANTLR.
This means the lexer code is no longer portable and cannot be used to generate non-python code.
Is this strictly necessary or can we offload the logic to the visitor?

See discussion in the comment below: #708 (comment)

Following our chat, I moved that logic to the visitor in c5d3a07

omry · 2021-05-06T19:18:55Z

tests/test_grammar.py

@@ -255,8 +260,10 @@
    ("str_quoted_inter", "'${null}'", "None"),
    ("str_quoted_esc_single_1", r"'ab\'cd\'\'${str}'", "ab'cd''hi"),
    ("str_quoted_esc_single_2", r"""'\\\${foo}'""", r"\${foo}"),
+    ("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()),


This is testing:

that double backslash is not unescaped before string interpolation

that double backslash is unescaped before a trailing quote.

I would prefer to split this into two different tests.

Point 2 is actually tested already in the tests for quoted strings without interpolations:

("str_quoted_trailing_esc_1", r"'abc\\'", r" abc\ ".strip()),

but I wanted to also have it tested with interpolations.
Anyway, I split it in two in f94e3da

omry · 2021-05-06T19:19:55Z

tests/test_grammar.py

    ("str_quoted_esc_double_1", r'"ab\"cd\"\"${str}"', 'ab"cd""hi'),
    ("str_quoted_esc_double_2", r'''"\\\${foo}"''', r"\${foo}"),
+    ("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r" \\a_hi\ ".strip()),


Suggested change

("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r" \\a_hi\ ".strip()),

("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r"\\a_hi\ ".strip()),

I had purposedly added a leading whitespace to all strings where I use the strip() trick as I personally find it easier to read. If I'm changing this one I should change all others. Just checking first that you indeed prefer to only have a trailing whitespace everywhere?

omry · 2021-05-06T19:20:23Z

tests/test_grammar.py

@@ -255,8 +260,10 @@
    ("str_quoted_inter", "'${null}'", "None"),
    ("str_quoted_esc_single_1", r"'ab\'cd\'\'${str}'", "ab'cd''hi"),
    ("str_quoted_esc_single_2", r"""'\\\${foo}'""", r"\${foo}"),
+    ("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()),


Suggested change

("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()),

("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r"\\a_hi\ ".strip()),

omry · 2021-05-06T19:25:24Z

tests/test_grammar.py

+    ("str_top_leading_escapes_2", r"\\\\ \${str}", r"\\\\ ${str}"),
+    ("str_top_middle_escapes_1", r"abc\\\\\${str}", r"abc\\${str}"),
+    ("str_top_middle_escapes_2", r"abc\\\\ \${str}", r"abc\\\\ ${str}"),
+    ("str_top_trailing_escapes", "${str}" + "\\" * 5, "hi" + "\\" * 5),


I don't think we need 5 \\ here. this is easier to understand and testing the same thing as far as I can tell:

Suggested change

("str_top_trailing_escapes", "${str}" + "\\" * 5, "hi" + "\\" * 5),

("str_top_trailing_escapes", r"${str}\\ ".strip(), r"hi\\".strip()),

I made it 3 \\\ in 1eb0dcc

odelalleau · 2021-05-06T20:20:39Z

Keeping the simpler lexer rules but making the un-escaping code in the visitor context-dependent (based on whether we are in a top-level, quoted or unquoted string, and what is the next token). This would essentially shift the complexity from the lexer to the visitor.

I think I prefer this approach as it keeps the generated code portable (it could be used in the future to create OmegaConf for different languages).
Can you explain why this logic would be sensitive to all these conditions? (in a top-level, quoted or unquoted string, and what is the next token)

If we go this route we may consider just reverting back to #621 which simplifies the lexer even further (the quoted string modes may be overkill once custom escaping logic in the visitor solves the main ugly escaping situations that we originally introduced these modes for).

But to answer your question, right now the un-escaping phase in the visitor relies on the _unescape() method which simply unescapes two types of lexer tokens: ESC (for any escape sequence where un-escaping consists in keeping every other character), and ESC_INTER (for escaped interpolations). This method is called both from visitText() (used in both top-level and quoted strings) and _createPrimitive() (corresponding to unquoted strings).

If we remove the lookahead predicate from the lexer, it won't be able to make the difference between a sequence of \\ that precedes an interpolation (or a quote in a quoted string) vs. those that don't => they will be mapped to the same token ESC. Which means that in _unescape() when we see an ESC token:

If it's in an unquoted string we can un-escape as usual
If it's in a top-level string we need to check the next child, and if it's an interpolation we should un-escape, otherwise we shouldn't
If it's in a quoted string we need to check the next child, and if it's an interpolation or the closing quote we should un-escape, otherwise we shouldn't

Actually, writing this, it'd probably be better to use multiple tokens instead of a single ESC, e.g. TOP_ESC, QSINGLE_ESC, QDOUBLE_ESC which would make it possible in _unescape() to check the token type to know what to do (instead of having to pass extra context to the function).

I don't mind doing it this way instead if you prefer, but first let me know if you'd still want to keep the lexer modes for quoted strings, or if I should just use the approach from #621 (which means going back to the quoted strings callback). The two implementations would be significantly different (with lexer modes at least I know which tokens I may need to escape, while in #621 I had to use regexes to deal with quoted strings, since the lexer provides just a QUOTED_VALUE token).

odelalleau · 2021-05-06T20:23:52Z

This does have some appeal, but yeah - it's raising too much dust up the stack. we should probably avoid it.

Side note: this led me to check if it might be possible for the lexer to generate multiple tokens from a single rule, and apparently it's possible by overriding some of ANTLR's internals, but it's uncharted territory I'd rather stay away from ;)

omry

Looking pretty clean.
Too hard to really be sure it's doing what it should looking at the code, but this is what tests are for.

See questions.

omegaconf/grammar/OmegaConfGrammarLexer.g4

omegaconf/grammar_visitor.py

odelalleau · 2021-05-06T23:22:02Z

Too hard to really be sure it's doing what it should looking at the code, but this is what tests are for.

Exactly what I was thinking while working on it :)

omry

Looking good.
See inline comments, no change is requested.

omegaconf/grammar/OmegaConfGrammarLexer.g4

omegaconf/grammar/OmegaConfGrammarParser.g4

omegaconf/grammar_visitor.py

omry · 2021-05-06T23:34:35Z

Oh, can you run the Hydra core unit tests on this?

odelalleau · 2021-05-07T02:44:31Z

Oh, can you run the Hydra core unit tests on this?

Sure! They passed.

Grammar: only un-escape \\ in situations where escaping is required

58821c7

This simplifies situations where one wants to obtain a string containing multiple backslashes.

omry reviewed May 6, 2021

View reviewed changes

odelalleau added 2 commits May 6, 2021 15:47

Split test in two

f94e3da

Minor test tweak

1eb0dcc

Move the detection of \ to escape from lexer to visitor

c5d3a07

omry reviewed May 6, 2021

View reviewed changes

omegaconf/grammar/OmegaConfGrammarLexer.g4 Show resolved Hide resolved

omegaconf/grammar_visitor.py Show resolved Hide resolved

odelalleau mentioned this pull request May 6, 2021

Follow-up plan regarding changes to resolvers #535

Closed

22 tasks

omry approved these changes May 6, 2021

View reviewed changes

omegaconf/grammar/OmegaConfGrammarLexer.g4 Show resolved Hide resolved

omegaconf/grammar/OmegaConfGrammarParser.g4 Show resolved Hide resolved

omegaconf/grammar_visitor.py Show resolved Hide resolved

odelalleau merged commit e4681ef into omry:master May 7, 2021

odelalleau deleted the lighter_escaping branch May 7, 2021 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grammar: only un-escape \\ in situations where escaping is required #708

Grammar: only un-escape \\ in situations where escaping is required #708

odelalleau commented May 6, 2021

odelalleau commented May 6, 2021 •

edited

Loading

omry left a comment

omry May 6, 2021

odelalleau May 6, 2021

odelalleau May 6, 2021

omry May 6, 2021

odelalleau May 6, 2021 •

edited

Loading

omry May 6, 2021

odelalleau May 6, 2021

omry May 6, 2021

omry May 6, 2021

odelalleau May 6, 2021

odelalleau commented May 6, 2021

odelalleau commented May 6, 2021

omry left a comment

odelalleau commented May 6, 2021

omry left a comment

omry commented May 6, 2021

odelalleau commented May 7, 2021

	("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r" \\a_hi\ ".strip()),
	("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r"\\a_hi\ ".strip()),

	("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()),
	("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r"\\a_hi\ ".strip()),

	("str_top_trailing_escapes", "${str}" + "\\" * 5, "hi" + "\\" * 5),
	("str_top_trailing_escapes", r"${str}\\ ".strip(), r"hi\\".strip()),

Grammar: only un-escape \\ in situations where escaping is required #708

Grammar: only un-escape \\ in situations where escaping is required #708

Conversation

odelalleau commented May 6, 2021

odelalleau commented May 6, 2021 • edited Loading

omry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau commented May 6, 2021

odelalleau commented May 6, 2021

omry left a comment

Choose a reason for hiding this comment

odelalleau commented May 6, 2021

omry left a comment

Choose a reason for hiding this comment

omry commented May 6, 2021

odelalleau commented May 7, 2021

odelalleau commented May 6, 2021 •

edited

Loading

odelalleau May 6, 2021 •

edited

Loading