Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar: only un-escape \\ in situations where escaping is required #708

Merged
merged 4 commits into from
May 7, 2021

Conversation

odelalleau
Copy link
Collaborator

This simplifies situations where one wants to obtain a string containing multiple backslashes.

The situations where \\ needs to be un-escaped into \ are:

  1. In top-level strings: when they precede a (potentially escaped) interpolation, e.g. \\${foo} (=> \<value_of_foo>) or \\\${foo} (-> \${foo})
  2. In quoted strings: in the above situation or when they precede a (potentially escaped) quote, e.g.: "\\" (=> \), or "\\\"" (=> \")

In all other situations, \\ is left untouched, e.g.: "abc\\def" (quoted string => abc\\def), or \\foo_${bar} (top-level string => \\foo_<value_of_bar>)

Implementation note: I considered several alternatives and ended up doing most of the work at the lexer level. It makes the lexer more complex but has the advantage of leaving the un-escaping code the same (up to a minor change due to the way I extended the escaped interpolation token ESC_INTER). The other two alternatives I considered were:

  • Keeping the simpler lexer rules but making the un-escaping code in the visitor context-dependent (based on whether we are in a top-level, quoted or unquoted string, and what is the next token). This would essentially shift the complexity from the lexer to the visitor.
  • Including the \\ preceding an interpolation (or the closing quote in a quoted string) in the same token, e.g. TOP_INTER_OPEN: ESC_BACKSLASH* INTER_OPEN -> type(INTER_OPEN), pushMode(INTERPOLATION_MODE);. This would avoid having to use the lookahead predicate I am using right now. The main concern I have with this approach is that it makes the grammar confusing, because now an interpolation can be a regular interpolation ${foo}, or an interpolation preceded by backslashes \\${foo}, and this has a very different meaning at the top-level (the first one is a direct node interpolation while the second one is a string interpolation). I think that explicitly splitting the escaped \\ from the interpolation is cleaner.

This simplifies situations where one wants to obtain a string containing
multiple backslashes.
@odelalleau
Copy link
Collaborator Author

odelalleau commented May 6, 2021

Just a note that with this PR, the user-facing behavior is now almost the same as what I had in #621, with the exception that (i) here we can write a quoted string "${foo:"bar"}" without escaping the inner quotes, and (ii) the validation of quoted strings' content is now performed on assignment instead of on access. Compared in #621, in terms of implementation:

  • The lexer is more complex (extra modes for quoted strings, lookahead predicates to identify which \\ to un-escape)
  • The visitor is simpler (no more quoted string callback, no involved logic to figure out which \\ need un-escaping)

Personally I'd rather keep this PR rather than revert back to #621, but I thought I'd mention it to make the trade-offs clear.

Copy link
Owner

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the simpler lexer rules but making the un-escaping code in the visitor context-dependent (based on whether we are in a top-level, quoted or unquoted string, and what is the next token). This would essentially shift the complexity from the lexer to the visitor.

I think I prefer this approach as it keeps the generated code portable (it could be used in the future to create OmegaConf for different languages).
Can you explain why this logic would be sensitive to all these conditions? (in a top-level, quoted or unquoted string, and what is the next token)

Including the \ preceding an interpolation (or the closing quote in a quoted string) in the same token, e.g. TOP_INTER_OPEN: ESC_BACKSLASH* INTER_OPEN -> type(INTER_OPEN), pushMode(INTERPOLATION_MODE);. This would avoid having to use the lookahead predicate I am using right now. The main concern I have with this approach is that it makes the grammar confusing, because now an interpolation can be a regular interpolation ${foo}, or an interpolation preceded by backslashes \${foo}, and this has a very different meaning at the top-level (the first one is a direct node interpolation while the second one is a string interpolation). I think that explicitly splitting the escaped \ from the interpolation is cleaner.

This does have some appeal, but yeah - it's raising too much dust up the stack. we should probably avoid it.

omegaconf/grammar/OmegaConfGrammarLexer.g4 Show resolved Hide resolved

// Escaped backslashes: even number of \ followed by an interpolation.
// The interpolation must *not* be matched by this rule (this is why we use a predicate lookahead).
TOP_ESC: ESC_BACKSLASH+ { self._input.LA(1) == ord("$") and self._input.LA(2) == ord("{") }? -> type(ESC);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time we are adding custom code to ANTLR.
This means the lexer code is no longer portable and cannot be used to generate non-python code.
Is this strictly necessary or can we offload the logic to the visitor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion in the comment below: #708 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following our chat, I moved that logic to the visitor in c5d3a07

@@ -255,8 +260,10 @@
("str_quoted_inter", "'${null}'", "None"),
("str_quoted_esc_single_1", r"'ab\'cd\'\'${str}'", "ab'cd''hi"),
("str_quoted_esc_single_2", r"""'\\\${foo}'""", r"\${foo}"),
("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing:

  1. that double backslash is not unescaped before string interpolation
  2. that double backslash is unescaped before a trailing quote.

I would prefer to split this into two different tests.

Copy link
Collaborator Author

@odelalleau odelalleau May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point 2 is actually tested already in the tests for quoted strings without interpolations:

("str_quoted_trailing_esc_1", r"'abc\\'", r" abc\ ".strip()),

but I wanted to also have it tested with interpolations.
Anyway, I split it in two in f94e3da

("str_quoted_esc_double_1", r'"ab\"cd\"\"${str}"', 'ab"cd""hi'),
("str_quoted_esc_double_2", r'''"\\\${foo}"''', r"\${foo}"),
("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r" \\a_hi\ ".strip()),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r" \\a_hi\ ".strip()),
("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r"\\a_hi\ ".strip()),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had purposedly added a leading whitespace to all strings where I use the strip() trick as I personally find it easier to read. If I'm changing this one I should change all others. Just checking first that you indeed prefer to only have a trailing whitespace everywhere?

@@ -255,8 +260,10 @@
("str_quoted_inter", "'${null}'", "None"),
("str_quoted_esc_single_1", r"'ab\'cd\'\'${str}'", "ab'cd''hi"),
("str_quoted_esc_single_2", r"""'\\\${foo}'""", r"\${foo}"),
("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()),
("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r"\\a_hi\ ".strip()),

("str_top_leading_escapes_2", r"\\\\ \${str}", r"\\\\ ${str}"),
("str_top_middle_escapes_1", r"abc\\\\\${str}", r"abc\\${str}"),
("str_top_middle_escapes_2", r"abc\\\\ \${str}", r"abc\\\\ ${str}"),
("str_top_trailing_escapes", "${str}" + "\\" * 5, "hi" + "\\" * 5),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need 5 \\ here. this is easier to understand and testing the same thing as far as I can tell:

Suggested change
("str_top_trailing_escapes", "${str}" + "\\" * 5, "hi" + "\\" * 5),
("str_top_trailing_escapes", r"${str}\\ ".strip(), r"hi\\".strip()),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it 3 \\\ in 1eb0dcc

@odelalleau
Copy link
Collaborator Author

Keeping the simpler lexer rules but making the un-escaping code in the visitor context-dependent (based on whether we are in a top-level, quoted or unquoted string, and what is the next token). This would essentially shift the complexity from the lexer to the visitor.

I think I prefer this approach as it keeps the generated code portable (it could be used in the future to create OmegaConf for different languages).
Can you explain why this logic would be sensitive to all these conditions? (in a top-level, quoted or unquoted string, and what is the next token)

If we go this route we may consider just reverting back to #621 which simplifies the lexer even further (the quoted string modes may be overkill once custom escaping logic in the visitor solves the main ugly escaping situations that we originally introduced these modes for).

But to answer your question, right now the un-escaping phase in the visitor relies on the _unescape() method which simply unescapes two types of lexer tokens: ESC (for any escape sequence where un-escaping consists in keeping every other character), and ESC_INTER (for escaped interpolations). This method is called both from visitText() (used in both top-level and quoted strings) and _createPrimitive() (corresponding to unquoted strings).

If we remove the lookahead predicate from the lexer, it won't be able to make the difference between a sequence of \\ that precedes an interpolation (or a quote in a quoted string) vs. those that don't => they will be mapped to the same token ESC. Which means that in _unescape() when we see an ESC token:

  • If it's in an unquoted string we can un-escape as usual
  • If it's in a top-level string we need to check the next child, and if it's an interpolation we should un-escape, otherwise we shouldn't
  • If it's in a quoted string we need to check the next child, and if it's an interpolation or the closing quote we should un-escape, otherwise we shouldn't

Actually, writing this, it'd probably be better to use multiple tokens instead of a single ESC, e.g. TOP_ESC, QSINGLE_ESC, QDOUBLE_ESC which would make it possible in _unescape() to check the token type to know what to do (instead of having to pass extra context to the function).

I don't mind doing it this way instead if you prefer, but first let me know if you'd still want to keep the lexer modes for quoted strings, or if I should just use the approach from #621 (which means going back to the quoted strings callback). The two implementations would be significantly different (with lexer modes at least I know which tokens I may need to escape, while in #621 I had to use regexes to deal with quoted strings, since the lexer provides just a QUOTED_VALUE token).

@odelalleau
Copy link
Collaborator Author

This does have some appeal, but yeah - it's raising too much dust up the stack. we should probably avoid it.

Side note: this led me to check if it might be possible for the lexer to generate multiple tokens from a single rule, and apparently it's possible by overriding some of ANTLR's internals, but it's uncharted territory I'd rather stay away from ;)

Copy link
Owner

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty clean.
Too hard to really be sure it's doing what it should looking at the code, but this is what tests are for.

See questions.

omegaconf/grammar/OmegaConfGrammarLexer.g4 Show resolved Hide resolved
omegaconf/grammar_visitor.py Show resolved Hide resolved
@odelalleau
Copy link
Collaborator Author

Too hard to really be sure it's doing what it should looking at the code, but this is what tests are for.

Exactly what I was thinking while working on it :)

Copy link
Owner

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.
See inline comments, no change is requested.

omegaconf/grammar/OmegaConfGrammarLexer.g4 Show resolved Hide resolved
omegaconf/grammar/OmegaConfGrammarParser.g4 Show resolved Hide resolved
omegaconf/grammar_visitor.py Show resolved Hide resolved
@omry
Copy link
Owner

omry commented May 6, 2021

Oh, can you run the Hydra core unit tests on this?

@odelalleau odelalleau merged commit e4681ef into omry:master May 7, 2021
@odelalleau odelalleau deleted the lighter_escaping branch May 7, 2021 02:25
@odelalleau
Copy link
Collaborator Author

Oh, can you run the Hydra core unit tests on this?

Sure! They passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants