Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaped interpolations trigger a GrammarParseError if they don't follow the expected interpolation grammar #666

Closed
odelalleau opened this issue Apr 7, 2021 · 14 comments · Fixed by #709
Assignees
Labels
bug Something isn't working
Milestone

Comments

@odelalleau
Copy link
Collaborator

Describe the bug
The following code triggers a GrammarParseError:

cfg = OmegaConf.create({"x": "\\${not an interpolation}"})
cfg.x

Expected behavior

There should be no grammar syntax constraint after a \${ since it is not a real interpolation.

Discussion

What happens is that when we dereference cfg.x, the resulting string ("${not an interpolation}") is wrapped within a node (AnyNode because here x is untyped), whose constructor calls ValueNode._set_value(), that checks if the the value is an interpolation... and this check raises an exception because it's a badly formed interpolation.

The bandaid fix is to set strict_interpolation_validation=False at

value, strict_interpolation_validation=True

It would break the tests that specifically test for this check, but overall things would still work fine.

This bandaid fix, however, wouldn't change the fact that cfg._get_node("x")._dereference_node()._is_interpolation() is True, i.e., when we dereference cfg.x we obtain a node that believes it is an interpolation (!). Right now this isn't really a problem because we don't do anything with this node except getting its value, but it may bite us later.

Right now I don't have a solution that I really like in mind, so posting this issue for discussion.
Here are options I've thought about so far:

  • The bandaid fix mentioned above (currently my preferred option for 2.1)
  • Creating a subclass of StringNode specifically when an interpolation result is a string with ${. This subclass would override what it takes to avoid a crash. One issue with this is that if cfg.x is actually typed, we'd like to return another type of node (ex: IntegerNode), so there should be an exception instead (it seems doable, just a bit clumsy)
  • Adding a flag to StringNode that says "I am not an interpolation" even if the value looks like one. This would be similar to the previous idea, but without adding an extra class.
  • Add a more general flag to all nodes, is_interpolation={True,False,None}, where None would be the current behavior (check if the value is a string looking like an interpolation), False would make it possible to "declare" the value as not being an interpolation (even if it looks like one), and True would declare it as being an interpolation (regardless of the value). Besides fixing this issue, this could enable more things like (i) caching the interpolation status in that flag for faster checks, or (ii) enabling new kinds of interpolations that aren't tied to the string content (e.g. to point to config nodes that aren't accessible by a path from the root config, as I once did in a draft PR). I kinda like this option but I expect it to be non trivial and uncover issues I haven't thought about yet, so I'd rather not go there in 2.1

Thoughts?

@odelalleau odelalleau added the bug Something isn't working label Apr 7, 2021
@odelalleau odelalleau added this to the OmegaConf 2.1 milestone Apr 7, 2021
@odelalleau odelalleau self-assigned this Apr 7, 2021
@omry
Copy link
Owner

omry commented Apr 8, 2021

The bandaid fix as you call it is also a performance concern.

The idea with having a fast is_interpolation that can return False positives was to improve performance by avoiding full grammar parse to detect interpolations.

The first solution should be acceptable if the performance is good in some real world scenarios.

The second solution sounds terrible.

The last two are going in a direction of caching the parsing results.
I think this is a good direction, we can cache the output of a strict get_value_kind and even the parse tree itself if it makes sense.
This should definitely not happen in 2.1 though, I think it's a pretty big change and it's not trivial to achieve that without potentially paying a significant cost during config initialization (because we are doing more real parses).

Right now the logic in get_value_kind is:

    # We identify potential interpolations by the presence of "${" in the string.
    # Note that escaped interpolations (ex: "esc: \${bar}") are identified as
    # interpolations: this is intended, since they must be processed as interpolations
    # for the string to be properly un-escaped.
    # Keep in mind that invalid interpolations will only be detected when
    # `strict_interpolation_validation` is True.
    if isinstance(value, str) and "${" in value:
        if strict_interpolation_validation:
            # First try the cheap regex matching that detects common interpolations.
            if SIMPLE_INTERPOLATION_PATTERN.match(value) is None:
                # If no match, do the more expensive grammar parsing to detect errors.
                parse(value)
        return ValueKind.INTERPOLATION

One thing that looks weird to me is: what is the purpose of calling parse(value)?
Is the intention if get_value_kind to raise an exception?
I think we have discussed it, but looking now it looks weird. My expectation is that if the string is not valid then ValueKind should be VALUE.

Second thing that jumps out:
why do we need the flag strict_interpolation_validation?
What would be the cost of just running the regex for any string that passes the initial ${ sniff test?

@odelalleau
Copy link
Collaborator Author

The bandaid fix as you call it is also a performance concern.

It's actually improving performance (strict_interpolation_validation=False is faster, it just won't detect grammar errors).

This should definitely not happen in 2.1 though

Agreed.

One thing that looks weird to me is: what is the purpose of calling parse(value)?
Is the intention if get_value_kind to raise an exception?

Yes. If I remember correctly personally I was fine with having strict_interpolation_validation=False everywhere and letting the code crash on access, but you preferred to have a check earlier (which led to the simple regex to avoid paying too much of a performance hit).

I think we have discussed it, but looking now it looks weird. My expectation is that if the string is not valid then ValueKind should be VALUE.

I think that would cause problems because badly formed interpolations wouldn't be detected anymore.

Second thing that jumps out:
why do we need the flag strict_interpolation_validation?

Same reason as above: to detect errors early.

What would be the cost of just running the regex for any string that passes the initial ${ sniff test?

Running the regex is pointless if we don't do anything when the regex doesn't return a match: the regex is only here to avoid the costly grammar parsing in common situations.

@omry
Copy link
Owner

omry commented Apr 8, 2021

You misunderstood me.
I meant this:

    if isinstance(value, str) and "${" in value:
        # First try the cheap regex matching that detects common interpolations.
        if SIMPLE_INTERPOLATION_PATTERN.match(value) is None:
            # If no match, do the more expensive grammar parsing to detect errors.
            parse(value)
        return ValueKind.INTERPOLATION

@omry
Copy link
Owner

omry commented Apr 8, 2021

The bandaid fix is to set strict_interpolation_validation=False at

Now I am confused.
I though the solution was to be more strict, not less strict.

If this is not an interpolation, why does the the strictest _get_value_kind says that it is?
You are already parsing the value. at this point you can check if it's an interpolation or not based on what it is, and not by the fact that it passed parsing.

@odelalleau
Copy link
Collaborator Author

You misunderstood me.
I meant this:

    if isinstance(value, str) and "${" in value:
        # First try the cheap regex matching that detects common interpolations.
        if SIMPLE_INTERPOLATION_PATTERN.match(value) is None:
            # If no match, do the more expensive grammar parsing to detect errors.
            parse(value)
        return ValueKind.INTERPOLATION

What's your objective here? If we remove the strict_interpolation_validation flag then we always validate interpolation syntax, which (i) doesn't help with this issue, and (ii) incurs a performance cost.

The bandaid fix is to set strict_interpolation_validation=False at

Now I am confused.
I though the solution was to be more strict, not less strict.

It is to be less strict so that we don't notice that the string is a badly formed interpolation.

If this is not an interpolation, why does the the strictest _get_value_kind says that it is?
You are already parsing the value. at this point you can check if it's an interpolation or not based on what it is, and not by the fact that it passed parsing.

I'm not sure I understand your last sentence, but I think the confusion comes from me not explaining well enough what is happening. Let me write in more details the steps that occur when we run this code:

cfg = OmegaConf.create({"x": r"\${not an interpolation}"})
cfg.x
  1. When the cfg.x statement is executed, the x node is fetched and its _maybe_resolve_interpolation() method is called
  2. In _maybe_resolve_interpolation() we detect that it is an interpolation (escaped interpolations are considered as a special kind of string interpolation, this is required so that they can be properly un-escaped)
  3. Then we obtain the parse tree with the grammar and call _resolve_interpolation_from_parse_tree()
  4. This calls resolve_parse_tree() which runs the grammar visitor on the string "\${not an interpolation}" and returns the resulting string resolved = "${not an interpolation}" (so far so good)
  5. The next step in _resolve_interpolation_from_parse_tree() is to call _validate_and_convert_interpolation_result(). In that function, we detect that resolved is a string and thus has to be wrapped into a node, so we call _wrap_interpolation_result() on it, which calls _node_wrap()
  6. Since the type passed to _node_wrap() is Any (the type of the x node), _node_wrap() attempts to create an AnyNode whose value is resolved, i.e. the string "${not an interpolation}"
  7. That's where we have a problem: the constructor of AnyNode calls the constructor of ValueNode, which calls ValueNode._set_value(), which checks if the input is an interpolation with strict_interpolation_validation=True... which raises an exception because "${not an interpolation}" is invalid syntax for an interpolation.

@omry
Copy link
Owner

omry commented Apr 8, 2021

What's your objective here? If we remove the strict_interpolation_validation flag then we always validate interpolation syntax, which (i) doesn't help with this issue, and (ii) incurs a performance cost.

The objective here was to to always be strict, which would solve the wrong problem if this failing by not being strict enough.
The performance cost should be negligible because very few strings would pass the ${ check without actually having an interpolation.
In any case, it's not important.

Going to look at the second part of your answer now.

@odelalleau
Copy link
Collaborator Author

The performance cost should be negligible because very few strings would pass the ${ check without actually having an interpolation.

Just to be sure it's clear, the performance cost would come from the various calls to _get_value_kind() that occur across the codebase after assignment, that now would all trigger the regex (and possibly full grammar) checks, which are slower than just checking for the presence of ${. With the current implementation we only run those checks on assignment.

Also all strings that contain ${ must be considered interpolations (i.e, "very few" is actually "zero").

@omry
Copy link
Owner

omry commented Apr 8, 2021

In _maybe_resolve_interpolation() we detect that it is an interpolation (escaped interpolations are considered as a special kind of string interpolation, this is required so that they can be properly un-escaped)

This could be one way to attack this issue.
Can you give more examples of what considering escaped interpolations as special kind of interpolations buys us?

Went through the rest and ran it in the debugger.
I think this boils down to get_value_kind raising an exception when confronted with a malformed interpolation instead of treating it as a value.

This is a change from the behavior of OmegaConf 2.0, which allows malformed interpolations as values:
(In fact, this is an undocumented breaking change).

In [1]: from omegaconf import *

In [2]: __version__
Out[2]: '2.0.6'

In [3]: cfg = OmegaConf.create({"x": r"${n"})

In [4]: cfg.x
Out[4]: '${n'

We have discussed it before, and I suggested that get_value_kind() does not raise exceptions.

Changing _get_value_kind interpolation detection to this:

    if isinstance(value, str) and "${" in value:
        if strict_interpolation_validation:
            # First try the cheap regex matching that detects common interpolations.
            if SIMPLE_INTERPOLATION_PATTERN.match(value) is None:
                # If no match, do the more expensive grammar parsing to detect errors.
                try:
                    parse(value)
                except GrammarParseError:
                    return ValueKind.VALUE

Fixes this particular issue and is breaking a handful of tests because as expected some things are not considered errors now.

My preference is actually to retain the old behavior and consider broken interpolations as values.

@omry
Copy link
Owner

omry commented Apr 8, 2021

The performance cost should be negligible because very few strings would pass the ${ check without actually having an interpolation.

Just to be sure it's clear, the performance cost would come from the various calls to _get_value_kind() that occur across the codebase after assignment, that now would all trigger the regex (and possibly full grammar) checks, which are slower than just checking for the presence of ${. With the current implementation we only run those checks on assignment.

"that now would all trigger the regex" if the string contains ${.

@odelalleau
Copy link
Collaborator Author

odelalleau commented Apr 8, 2021

Can you give more examples of what considering escaped interpolations as special kind of interpolations buys us?

It buys us the un-escaping of escaped interpolations. If we said "it's just a regular string" then the following code would return "\${y}" instead of "${y}":

cfg = OmegaConf.create({"x": r"\${y}"})
cfg.x

Note also that even if we make it so that escaped interpolations aren't considered interpolations anymore, it may help in the specific example I gave here, but the same problem would occur if for instance a resolver returns a string that looks like an interpolation:

OmegaConf.register_new_resolver("oops", lambda: "${trigger crash}")
cfg = OmegaConf.create({"x": "${oops:}"})
cfg.x  # GrammarParseError

I guess what might work is:

  • Delaying the un-escaping of \${ to the call of ._value() instead of doing it in the grammar visitor
  • Saying that a string that doesn't contain any regular (i.e, not escaped) interpolation is not an interpolation according to get_value_kind()
  • Saying that the resolver example I just wrote above is invalid, and one should instead return r"\${trigger crash}" in the resolver (in other words, if the result of resolving an interpolation is a string, it is not allowed to be an interpolation according to get_value_kind())

I'm not 100% sure I like it because it makes the nodes' internals a bit more complex and adds extra processing on each access, but at first glance it seems doable. What do you think?

(edit: actually the "extra processing on each access" isn't a problem I think: it's cheaper than resolving interpolations, the extra cost for regular strings should be minor, and we could cache the result => I'm starting to like this direction, though there remain design questions to be resolved)

(edit2: having a second look at it, one potential implementation concern is that currently get_value_kind() analyzes the string returned by ._value(), so if we un-escape in ._value() then it wouldn't be able to differentiate between escaped vs non-escaped interpolations. We can probably work around this e.g. by adding a flag to ._value() to disable the un-escaping, but it would start getting a big ugly...).

My preference is actually to retain the old behavior and consider broken interpolations as values.

Just to understand, are you saying that it should be the case only when strict_interpolation_validation is True? (as in your code snippet). I'm not comfortable having a different output depending on the value of strict_interpolation_validation -- it might happen to solve this specific problem but it could bite us later.

If we do it also when strict_interpolation_validation is False then I see at least the following issues:

  1. strict_interpolation_validation is actually pointless now because we always need to do this strict validation => there is a performance cost
  2. More importantly, there would be no explicit error raised anymore for badly formed interpolations, instead the program may behave in surprising ways because what was supposed to be an interpolated value suddenly turned into a weird string. Since in the large majority of cases such badly formed interpolations are user error, I think there is value in catching them (I think the OmegaConf 2.0 behavior was pretty bad in that regard -- I actually remember having a few "WTF" moments back when working on the new grammar and trying stuff with 2.0 because I wrote an interpolation that wasn't recognized as such and processed seamlessly as a string instead).
  3. In the context of this issue, the "under-the-hood" behavior would be different depending on whether one writes {"x": "\\${not an interpolation}"} (the node obtained when dereferencing x would be considered as not being an interpolation) vs {"x": "\\${not_an_interpolation}"} (now it would be considered as being an interpolation, since ${not_an_interpolation} would pass the grammar parse step). I'm not sure if there would be any user-facing consequence to that, but it's definitely the kind of subtle difference that can cause problems down the road. In other words, IMO it remains another kind of "bandaid" fix and not a viable long term solution to this specific issue.

"that now would all trigger the regex" if the string contains ${.

If your point is that we only pay a price for interpolations, then yes I agree. But if we don't care why did we bother with the regex version in the first place? (it's only there to speed-up the processing of interpolations).

@odelalleau
Copy link
Collaborator Author

@omry just realized we hadn't converged on this one, can you please have a look when you get a chance?

@omry
Copy link
Owner

omry commented Apr 25, 2021

Yes, I postponed this because having multiple parallel design discussions was too time consuming and I felt it's best to focus on one at a time.
I am planning to get back to this one next week.

@omry
Copy link
Owner

omry commented May 10, 2021

Note also that even if we make it so that escaped interpolations aren't considered interpolations anymore, it may help in the specific example I gave here, but the same problem would occur if for instance a resolver returns a string that looks like an interpolation:

OmegaConf.register_new_resolver("oops", lambda: "${trigger crash}")
cfg = OmegaConf.create({"x": "${oops:}"})
cfg.x  # GrammarParseError

What is the desired behavior here? that the interpolation is not resolved?
Generally this seems like an odd thing to do.

I guess what might work is:

  • Delaying the un-escaping of \${ to the call of ._value() instead of doing it in the grammar visitor
  • Saying that a string that doesn't contain any regular (i.e, not escaped) interpolation is not an interpolation according to get_value_kind()
  • Saying that the resolver example I just wrote above is invalid, and one should instead return r"\${trigger crash}" in the resolver (in other words, if the result of resolving an interpolation is a string, it is not allowed to be an interpolation according to get_value_kind())

I'm not 100% sure I like it because it makes the nodes' internals a bit more complex and adds extra processing on each access, but at first glance it seems doable. What do you think?

It seems like a reasonable solution.

(edit: actually the "extra processing on each access" isn't a problem I think: it's cheaper than resolving interpolations, the extra cost for regular strings should be minor, and we could cache the result => I'm starting to like this direction, though there remain design questions to be resolved)

(edit2: having a second look at it, one potential implementation concern is that currently get_value_kind() analyzes the string returned by ._value(), so if we un-escape in ._value() then it wouldn't be able to differentiate between escaped vs non-escaped interpolations. We can probably work around this e.g. by adding a flag to ._value() to disable the un-escaping, but it would start getting a big ugly...).

We can resolve it by providing an API to access the raw value to be used by lower level functions like get_value_kind() (it just have it access the field directly instead of through _value()).

My preference is actually to retain the old behavior and consider broken interpolations as values.

Just to understand, are you saying that it should be the case only when strict_interpolation_validation is True? (as in your code snippet). I'm not comfortable having a different output depending on the value of strict_interpolation_validation -- it might happen to solve this specific problem but it could bite us later.

If we do it also when strict_interpolation_validation is False then I see at least the following issues:

  1. strict_interpolation_validation is actually pointless now because we always need to do this strict validation => there is a performance cost

As we have discussed several times, the computation result can potentially be cached to minimize the damage.

  1. More importantly, there would be no explicit error raised anymore for badly formed interpolations, instead the program may behave in surprising ways because what was supposed to be an interpolated value suddenly turned into a weird string. Since in the large majority of cases such badly formed interpolations are user error, I think there is value in catching them (I think the OmegaConf 2.0 behavior was pretty bad in that regard -- I actually remember having a few "WTF" moments back when working on the new grammar and trying stuff with 2.0 because I wrote an interpolation that wasn't recognized as such and processed seamlessly as a string instead).

Yes, this is a bigger problem now that that grammar is significantly more complicated.

  1. In the context of this issue, the "under-the-hood" behavior would be different depending on whether one writes {"x": "\\${not an interpolation}"} (the node obtained when dereferencing x would be considered as not being an interpolation) vs {"x": "\\${not_an_interpolation}"} (now it would be considered as being an interpolation, since ${not_an_interpolation} would pass the grammar parse step). I'm not sure if there would be any user-facing consequence to that, but it's definitely the kind of subtle difference that can cause problems down the road. In other words, IMO it remains another kind of "bandaid" fix and not a viable long term solution to this specific issue.

I am not too worried about this one.

Going to check your PR now.

@odelalleau
Copy link
Collaborator Author

odelalleau commented May 11, 2021

OmegaConf.register_new_resolver("oops", lambda: "${trigger crash}")
cfg = OmegaConf.create({"x": "${oops:}"})
cfg.x  # GrammarParseError

What is the desired behavior here? that the interpolation is not resolved?

Correct,x should be set to the string returned by the resolver, whatever it is. We shouldn't care if it looks like an interpolation (if we cared about resolving interpolations we would probably use instead something like ${oc.decode:${oops:}})

We can resolve it by providing an API to access the raw value to be used by lower level functions like get_value_kind() (it just have it access the field directly instead of through _value()).

Yeah, whether it's a new flag, a new method, or just doing a direct access, all are pretty much equivalent (IMO), i.e., it should work but it looks a bit clumsy.

odelalleau added a commit to odelalleau/omegaconf that referenced this issue May 11, 2021
odelalleau added a commit that referenced this issue May 12, 2021
Fix crash with "interpolation-like" strings from interpolations

This commit introduces a new node type `InterpolationResultNode` that systematically wraps interpolation results that either (a) are not already nodes, or (b) need to be converted.

Fixes #666
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants