-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grammar: only un-escape \\ in situations where escaping is required #708
Conversation
This simplifies situations where one wants to obtain a string containing multiple backslashes.
Just a note that with this PR, the user-facing behavior is now almost the same as what I had in #621, with the exception that (i) here we can write a quoted string
Personally I'd rather keep this PR rather than revert back to #621, but I thought I'd mention it to make the trade-offs clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the simpler lexer rules but making the un-escaping code in the visitor context-dependent (based on whether we are in a top-level, quoted or unquoted string, and what is the next token). This would essentially shift the complexity from the lexer to the visitor.
I think I prefer this approach as it keeps the generated code portable (it could be used in the future to create OmegaConf for different languages).
Can you explain why this logic would be sensitive to all these conditions? (in a top-level, quoted or unquoted string, and what is the next token)
Including the \ preceding an interpolation (or the closing quote in a quoted string) in the same token, e.g. TOP_INTER_OPEN: ESC_BACKSLASH* INTER_OPEN -> type(INTER_OPEN), pushMode(INTERPOLATION_MODE);. This would avoid having to use the lookahead predicate I am using right now. The main concern I have with this approach is that it makes the grammar confusing, because now an interpolation can be a regular interpolation ${foo}, or an interpolation preceded by backslashes \${foo}, and this has a very different meaning at the top-level (the first one is a direct node interpolation while the second one is a string interpolation). I think that explicitly splitting the escaped \ from the interpolation is cleaner.
This does have some appeal, but yeah - it's raising too much dust up the stack. we should probably avoid it.
|
||
// Escaped backslashes: even number of \ followed by an interpolation. | ||
// The interpolation must *not* be matched by this rule (this is why we use a predicate lookahead). | ||
TOP_ESC: ESC_BACKSLASH+ { self._input.LA(1) == ord("$") and self._input.LA(2) == ord("{") }? -> type(ESC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time we are adding custom code to ANTLR.
This means the lexer code is no longer portable and cannot be used to generate non-python code.
Is this strictly necessary or can we offload the logic to the visitor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion in the comment below: #708 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following our chat, I moved that logic to the visitor in c5d3a07
tests/test_grammar.py
Outdated
@@ -255,8 +260,10 @@ | |||
("str_quoted_inter", "'${null}'", "None"), | |||
("str_quoted_esc_single_1", r"'ab\'cd\'\'${str}'", "ab'cd''hi"), | |||
("str_quoted_esc_single_2", r"""'\\\${foo}'""", r"\${foo}"), | |||
("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is testing:
- that double backslash is not unescaped before string interpolation
- that double backslash is unescaped before a trailing quote.
I would prefer to split this into two different tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point 2 is actually tested already in the tests for quoted strings without interpolations:
("str_quoted_trailing_esc_1", r"'abc\\'", r" abc\ ".strip()),
but I wanted to also have it tested with interpolations.
Anyway, I split it in two in f94e3da
tests/test_grammar.py
Outdated
("str_quoted_esc_double_1", r'"ab\"cd\"\"${str}"', 'ab"cd""hi'), | ||
("str_quoted_esc_double_2", r'''"\\\${foo}"''', r"\${foo}"), | ||
("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r" \\a_hi\ ".strip()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r" \\a_hi\ ".strip()), | |
("str_quoted_esc_double_3", r'''"\\a_${str}\\"''', r"\\a_hi\ ".strip()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had purposedly added a leading whitespace to all strings where I use the strip() trick as I personally find it easier to read. If I'm changing this one I should change all others. Just checking first that you indeed prefer to only have a trailing whitespace everywhere?
tests/test_grammar.py
Outdated
@@ -255,8 +260,10 @@ | |||
("str_quoted_inter", "'${null}'", "None"), | |||
("str_quoted_esc_single_1", r"'ab\'cd\'\'${str}'", "ab'cd''hi"), | |||
("str_quoted_esc_single_2", r"""'\\\${foo}'""", r"\${foo}"), | |||
("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r" \\a_hi\ ".strip()), | |
("str_quoted_esc_single_3", r"""'\\a_${str}\\'""", r"\\a_hi\ ".strip()), |
tests/test_grammar.py
Outdated
("str_top_leading_escapes_2", r"\\\\ \${str}", r"\\\\ ${str}"), | ||
("str_top_middle_escapes_1", r"abc\\\\\${str}", r"abc\\${str}"), | ||
("str_top_middle_escapes_2", r"abc\\\\ \${str}", r"abc\\\\ ${str}"), | ||
("str_top_trailing_escapes", "${str}" + "\\" * 5, "hi" + "\\" * 5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need 5 \\
here. this is easier to understand and testing the same thing as far as I can tell:
("str_top_trailing_escapes", "${str}" + "\\" * 5, "hi" + "\\" * 5), | |
("str_top_trailing_escapes", r"${str}\\ ".strip(), r"hi\\".strip()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it 3 \\\
in 1eb0dcc
If we go this route we may consider just reverting back to #621 which simplifies the lexer even further (the quoted string modes may be overkill once custom escaping logic in the visitor solves the main ugly escaping situations that we originally introduced these modes for). But to answer your question, right now the un-escaping phase in the visitor relies on the If we remove the lookahead predicate from the lexer, it won't be able to make the difference between a sequence of
Actually, writing this, it'd probably be better to use multiple tokens instead of a single I don't mind doing it this way instead if you prefer, but first let me know if you'd still want to keep the lexer modes for quoted strings, or if I should just use the approach from #621 (which means going back to the quoted strings callback). The two implementations would be significantly different (with lexer modes at least I know which tokens I may need to escape, while in #621 I had to use regexes to deal with quoted strings, since the lexer provides just a |
Side note: this led me to check if it might be possible for the lexer to generate multiple tokens from a single rule, and apparently it's possible by overriding some of ANTLR's internals, but it's uncharted territory I'd rather stay away from ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty clean.
Too hard to really be sure it's doing what it should looking at the code, but this is what tests are for.
See questions.
Exactly what I was thinking while working on it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good.
See inline comments, no change is requested.
Oh, can you run the Hydra core unit tests on this? |
Sure! They passed. |
This simplifies situations where one wants to obtain a string containing multiple backslashes.
The situations where
\\
needs to be un-escaped into\
are:\\${foo}
(=>\<value_of_foo>
) or\\\${foo}
(->\${foo}
)"\\"
(=>\
), or"\\\""
(=>\"
)In all other situations,
\\
is left untouched, e.g.:"abc\\def"
(quoted string =>abc\\def
), or\\foo_${bar}
(top-level string =>\\foo_<value_of_bar>
)Implementation note: I considered several alternatives and ended up doing most of the work at the lexer level. It makes the lexer more complex but has the advantage of leaving the un-escaping code the same (up to a minor change due to the way I extended the escaped interpolation token ESC_INTER). The other two alternatives I considered were:
\\
preceding an interpolation (or the closing quote in a quoted string) in the same token, e.g.TOP_INTER_OPEN: ESC_BACKSLASH* INTER_OPEN -> type(INTER_OPEN), pushMode(INTERPOLATION_MODE);
. This would avoid having to use the lookahead predicate I am using right now. The main concern I have with this approach is that it makes the grammar confusing, because now an interpolation can be a regular interpolation${foo}
, or an interpolation preceded by backslashes\\${foo}
, and this has a very different meaning at the top-level (the first one is a direct node interpolation while the second one is a string interpolation). I think that explicitly splitting the escaped\\
from the interpolation is cleaner.