Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quoted strings are not always properly parsed by the grammar #617

Closed
odelalleau opened this issue Mar 17, 2021 · 15 comments · Fixed by #630
Closed

Quoted strings are not always properly parsed by the grammar #617

odelalleau opened this issue Mar 17, 2021 · 15 comments · Fixed by #630
Labels
bug Something isn't working

Comments

@odelalleau
Copy link
Collaborator

odelalleau commented Mar 17, 2021

Describe the bug

There exist tricky edge cases where quoted strings are not parsed as expected. Here is an example:

OmegaConf.clear_resolvers()
OmegaConf.register_new_resolver("identity", lambda x: x)
cfg = OmegaConf.create({
    "a": r""" ${identity: "hello\\" }"} """
})
print(cfg.a)  # ' hello\\" } '

The output is hello\" } but the expected output is hello\"} (the difference is no space between the " and })

The reason for this lies in the regex for quoted strings:

QUOTED_VALUE:
      '\'' ('\\\''|.)*? '\'' // Single quotes, can contain escaped single quote : /'
    | '"' ('\\"'|.)*? '"' ;  // Double quotes, can contain escaped double quote : /"

Because of the \\" in the parsed string, it believes that this quote is an escaped quote, and thus keeps parsing until the next quote is found. However, the intent here is for \\" to represent a backslash followed by a quote, i.e., the string should end with this backslash.

(if you wonder why the test case looks so weird, it is so that it can run error-free with and without a potential fix)

Expected behavior

See expected output.

Additional context

There is a link to #615: if we didn't care about quoted strings ending with a backslash, we wouldn't need to escape backslashes, which would solve both issues at the same time. But I think we do care :)

@odelalleau odelalleau added the bug Something isn't working label Mar 17, 2021
@omry
Copy link
Owner

omry commented Mar 17, 2021

The output is hello\" } but the expected output is hello\"} (the difference is no space between the " and })
The output in the comment is not hello\" } but hello\\" }. Which one is it?

Looking at this:
r""" ${identity: "hello\\" }"} """

I really have no idea what you are going for here.
You are trying to go for a hello\ or are you trying to escape the double quote in the string so that it will appear in the resulting output? are you trying to do both?

@omry
Copy link
Owner

omry commented Mar 17, 2021

On master, the output is ' hello" } ':

In [57]: OmegaConf.clear_resolvers()
    ...: OmegaConf.register_new_resolver("identity", lambda x: x)
    ...: cfg = OmegaConf.create({
    ...:     "a": r""" ${identity: "hello\\" }"} """
    ...: })
    ...: print(f"|{cfg.a}|")
| hello\" } |

@odelalleau
Copy link
Collaborator Author

The output is hello\" } but the expected output is hello\"} (the difference is no space between the " and })
The output in the comment is not hello\" } but hello\\" }. Which one is it?

Looking at this:
r""" ${identity: "hello\\" }"} """

I really have no idea what you are going for here.
You are trying to go for a hello\ or are you trying to escape the double quote in the string so that it will appear in the resulting output? are you trying to do both?

So, the intention is for ${identity: "hello\\" } to be an interpolation block where we pass to identity the string containing the following characters: hello\

However, the current code instead considers that the interpolation block is ${identity: "hello\\" }"} and provides to identity the string containing the following characters: hello\" }

@odelalleau
Copy link
Collaborator Author

On master, the output is ' hello" } ':

No, your code shows there is a backslash too (which isn't the problem -- the problem is that there should be no space after the double quote)

@omry
Copy link
Owner

omry commented Mar 17, 2021

So, the intention is for ${identity: "hello\\" } to be an interpolation block where we pass to identity the string containing the following characters: hello\

Like the following failed attempts?

# not working:
OmegaConf.create({"a": r"${identity: hello\}"}).a
...
GrammarParseError: missing BRACE_CLOSE at '<EOF>'
    full_key: a
    object_type=dict

# not working:
OmegaConf.create({"a": r"${identity: hello\\}"}).a
'hello\\'

OmegaConf.create({"a": r"${identity: 'hello\'}"}).a
'hello\\'

OmegaConf.create({"a": r"${identity: 'hello\\'}"}).a
'hello\\'

@omry
Copy link
Owner

omry commented Mar 17, 2021

Can you show me how you get Python to print 'hello\' (hello with a tailing backslash).

@odelalleau
Copy link
Collaborator Author

Like the following failed attempts?

Your last 3 examples actually worked, but:

OmegaConf.create({"a": r"${identity: hello\\}"}).a

=> only works without a quoted string because hello only has standard characters allowed in unquoted strings, replace it with h^llo for instance and you can't do it without quotes anymore

OmegaConf.create({"a": r"${identity: 'hello\'}"}).a

=> works because the regex extracting the quoted string actually identifies 'hello\' and doesn't go further, since there is no single quote afterwards. This is why this problem hasn't surfaced before, you need a really specific setup to highlight it (like the one in my example)

OmegaConf.create({"a": r"${identity: 'hello\\'}"}).a

=> same thing, the quoted string is 'hello\\' and the double escape gets converted in a single one during the un-escape phase of quoted strings

Can you show me how you get Python to print 'hello\' (hello with a tailing backslash).

Easy: print('hello\\')

What happened is you haven't been using print() explicitly, and the Python interpreter prints the output of __repr__() by default. And:

print(repr(r"There is a single backslash \ in this string"))
'There is a single backslash \\ in this string'

=> always use print() when there are \ in the string, otherwise it's quite confusing :)

odelalleau added a commit to odelalleau/omegaconf that referenced this issue Mar 18, 2021
* In top-level strings, the only backslashes that can and should be
  escaped are those preceding an interpolation. For instance, if "x"
  is a node evaluating to "X":
        \${x}   -> ${x}
        \\${x}  -> \X
        \\\${x} -> \${x}
        ${x}\\  -> X\\

* In quoted strings, the only backslashes that can and should be escaped
  are those preceding a quote of the same type as the enclosing quotes.
  For instance:
        "abc\"def" -> abc"def
        "abc\\" -> abc\
        "abc\\\"def" -> abc\"def
        "abc\\\'def" -> abc\\\'def

This also fixes a bug with the parsing of quoted strings (see omry#617).

Fixes omry#615
Fixes omry#617
odelalleau added a commit to odelalleau/omegaconf that referenced this issue Mar 18, 2021
* In top-level strings, the only backslashes that can and should be
  escaped are those preceding an interpolation. For instance, if "x"
  is a node evaluating to "X":
        \${x}   -> ${x}
        \\${x}  -> \X
        \\\${x} -> \${x}
        ${x}\\  -> X\\

* In quoted strings, the only backslashes that can and should be escaped
  are those preceding a quote of the same type as the enclosing quotes.
  For instance:
        "abc\"def" -> abc"def
        "abc\\" -> abc\
        "abc\\\"def" -> abc\"def
        "abc\\\'def" -> abc\\\'def

This also fixes a bug with the parsing of quoted strings (see omry#617).

Fixes omry#615
Fixes omry#617
odelalleau added a commit to odelalleau/omegaconf that referenced this issue Mar 19, 2021
@omry
Copy link
Owner

omry commented Mar 19, 2021

This time I paid special attention to how GitHub is rendering this crap, hopefully it will be clearer than my past attempts:

Going back to the original example:

There exist tricky edge cases where quoted strings are not parsed as expected. Here is an example:

OmegaConf.clear_resolvers()
OmegaConf.register_new_resolver("identity", lambda x: x)
cfg = OmegaConf.create({
    "a": r""" ${identity: "hello\\" }"} """
})
print(cfg.a)  # ' hello\\" } '

The output is hello\" } but the expected output is hello\"} (the difference is no space between the " and })

It's not clear to me what you are after with

r""" ${identity: "hello\\" }"} """

This seems ambiguous to me:

Interpretation 1:
\\ followed by a " => \" (backslash and then the string ends), which should create a syntax error because of the characters following the end of the string.

Interpretation 2:
\ followed by a \" => \ followed by " not closing the quoted string.
We and up with a singleton \ after un-escaping which I think should trigger an error.

To get your desired output, this seems to work and makes sense to me:

In [45]: cfg = OmegaConf.create({"a": r'${identity:"hello\\\"} "}'}); print(f"'{cfg.a}'");cfg.a == r"hello\"} "
'hello\"} '
Out[45]: True

@odelalleau
Copy link
Collaborator Author

It's not clear to me what you are after with

r""" ${identity: "hello\\" }"} """

This seems ambiguous to me:

Interpretation 1:
\\ followed by a " => \" (backslash and then the string ends), which should create a syntax error because of the characters following the end of the string.

That's (almost) the correct interpretation: since there are two \\ before the ", the string should end.
There should be no syntax error though, because users are free to use any character they want outside of interpolations. The following examples should all be valid:

OmegaConf.create({
    "a": "Hello ${world}!",  # nothing special
    "b": "Hello '${world}'!",  # kind of a quoted interpolation, though nothing special is done with the quotes
    "c": "Hello ${world}'!",  # quote mismatch but we don't care
    "d": "Hello ${world}}'!",  # quote and brace mismatch but we still don't care
})

To get your desired output, this seems to work and makes sense to me:
(...)

Just to be clear, the purpose of this issue isn't that we can't get a specific output, but to show a situation where the current code isn't outputting what it should.

@omry
Copy link
Owner

omry commented Mar 19, 2021

It's not clear to me what you are after with

r""" ${identity: "hello\\" }"} """

This seems ambiguous to me:
Interpretation 1:
\\ followed by a " => \" (backslash and then the string ends), which should create a syntax error because of the characters following the end of the string.

That's (almost) the correct interpretation: since there are two \\ before the ", the string should end.
There should be no syntax error though, because users are free to use any character they want outside of interpolations.

But we are not outside of an interpolation.
We are inside

${identity: ...}.

the ... should be a sequence of valid elements. we have ended an element (the string). the next thing should be a comma or closing the custom resolver. (ignoring whitespace).

r""" ${identity: "hello\\", foo, bar} """

@odelalleau
Copy link
Collaborator Author

odelalleau commented Mar 19, 2021

the ... should be a sequence of valid elements. we have ended an element (the string). the next thing should be a comma or closing the custom resolver. (ignoring whitespace).

I agree, and that's actually the case in r""" ${identity: "hello\\" }"} """: the next thing after the string ends is closing the custom resolver with the }, and what comes after belongs to the top-level string and should just be copied "as is"

@omry
Copy link
Owner

omry commented Mar 19, 2021

Okay, I FINALLY understand what you did there.
I thought that this was your string: "hello\\" }" and this was the outer interpolation: ${identity: "hello\\" }"}.

I will take another look at this later.

@odelalleau
Copy link
Collaborator Author

Okay, I FINALLY understand what you did there.

Sorry, I know it's a confusing example, but that's the whole point: it is confusing to the current parser.

I thought that this was your string: "hello\\" }" and this was the outer interpolation: ${identity: "hello\\" }"}.

So did the grammar :)

@omry
Copy link
Owner

omry commented Mar 23, 2021

the ... should be a sequence of valid elements. we have ended an element (the string). the next thing should be a comma or closing the custom resolver. (ignoring whitespace).

I agree, and that's actually the case in r""" ${identity: "hello\\" }"} """: the next thing after the string ends is closing the custom resolver with the }, and what comes after belongs to the top-level string and should just be copied "as is"

r""" ${identity: "hello\\" }"} """

So your intention is that the identity is getting "hello\", and outside of the custom resolver you have }" (space, brace, double quote)?

Shouldn't the trailing double quote be an error then? It should be the beginning of a new quotes string that is never terminating.

At a high level, I agree that when we see "hello\\", we should interpret it as a double quoted hello\.

@odelalleau
Copy link
Collaborator Author

r""" ${identity: "hello\\" }"} """

So your intention is that the identity is getting "hello\", and outside of the custom resolver you have }" (space, brace, double quote)?

Correct.

Shouldn't the trailing double quote be an error then? It should be the beginning of a new quotes string that is never terminating.

No because what's outside of interpolations is free-form, you can use quotes however you want. But you can add an extra quote if you prefer -- it shouldn't change the underlying issue.

At a high level, I agree that when we see "hello\\", we should interpret it as a double quoted hello\.

Ok good, we're on the same page then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants