Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Initial Feedback] New like and match keywords #29

Open
springcomp opened this issue Apr 2, 2023 · 16 comments
Open

[Initial Feedback] New like and match keywords #29

springcomp opened this issue Apr 2, 2023 · 16 comments

Comments

@springcomp
Copy link
Contributor

springcomp commented Apr 2, 2023

I have been thinking that comparisons using new like and match keywords would be a natural extension.

The comparator rule would be extended like so:

comparator =/ "like" / "match" 

Are these features that would be of interest?

Like

items[? foo like '**/*.json' ]

The like comparator would match simple SQL-like, wildcard-like or glob-like patterns.

I have investigated adding such contextual keywords to the language and found that it would probably easier with lex/yacc-based implementations, Nevertheless, using the top-down parser approach, one simply has to account for those keywords between the nud() and led() calls in the main parsing loop.

This makes the parsing algorithm less "pure" but I’m pretty sure more knowledgeable people might come up with more elegant designs.

Match

The match comparator would match simple regular expressions.

items[? foo match 'ba[rz]' ]

I know that regular expressions have been a touchy subject in the past but I strongly believe we can make this work for JMESPath with:

  • A narrow focus on matching only, unanchored expressions. That is, the expression returns true or false.
  • Restricting the syntax to a tiny but useful subset of interoperable syntax.

I have come up with a prototype for a reference implementation of a simple push-down automaton-based checker and the implementation is reasonaly tight and compact.

I believe most languages in which a JMESPath implementation exists do indeed support the interoperable subset.
The idea from the standazdization document referred to above is to:

  1. First check that the syntax is valid.
  2. Maps the interoperable regex to one compatible with the target implementation.
  3. Execute the target implementation expression.

The standardization documents lists mappings for ECMAScript, PCRE, RE2, Go and Ruby dialects.

Once relying on such a compact library, the implementation is in fact really easy.

@gibson042
Copy link

I agree that having a static published specification of syntax and semantics to reference makes this feasible, although it won't preclude an endless stream of requests for extension. Regardless, https://www.ietf.org/archive/id/draft-bormann-jsonpath-iregexp-02.html is a draft, and expired at that (although its latest successor, https://www.ietf.org/archive/id/draft-ietf-jsonpath-iregexp-04.html , is still active). Relying on draft documents is generally considered bad form, so would the plan be to pursue this only if draft-ietf-jsonpath-iregexp advances to RFC?

@gibson042
Copy link

Note also that extending comparator to include non-punctuation expansions will also introduce some complexity of its own, because surrounding whitespace will be required for proper tokenization (e.g., foo ==bar is valid but foo matchbar is not).

@springcomp
Copy link
Contributor Author

springcomp commented Apr 4, 2023

[...] so would the plan be to pursue this only if draft-ietf-jsonpath-iregexp advances to RFC?

image

The latest version is still a draft but appears more and more to be in line with a porentially future accepted standard.

But yes, the value is to rely on a shared common agreed-upon specification, to avoid implementation-specific regex dialects.
Following the discussions in their mailing lists, they want to promote a regex spec for matching only and within the context of JSONPath, which has a similar audience than JMESPath has.

I think this is the closest we can do short of creating a spec ourselves, which I would not have the hubris to even attempt.

@springcomp
Copy link
Contributor Author

Note also that extending comparator to include non-punctuation expansions will also introduce some complexity of its own, because surrounding whitespace will be required for proper tokenization (e.g., foo ==bar is valid but foo matchbar is not).

I initially thought of an alternative that I think I would be inline with, but I did not include this in the initial feedback pitch.
The alternative is to create a full blown match-expression and like-expression and not extending comparator.

It would look like this:

expression =/ like-expression / match-expression
like-expression = expression "like" expression
match-expression = expression "match" expression

We could even – although I do not see why we should – restrict to:

expression =/ like-expression / match-expression
like-expression = expression "like" raw-string
match-expression = expression "match" raw-string

If we are worried to allow dynamically created regular expressions.

@springcomp
Copy link
Contributor Author

Maybe we could split this proposal into two.…
One for like and one for match.

Maybe the like itself should be actually multiple keywords, one for SQL-like expression, one for wildcard, one for globs… etc.

@hell-racer
Copy link

hell-racer commented Apr 5, 2023

Maybe it's better to implement these like other string manipulation functions? E.g. match(foo, 'ba[rz]')? This way we will not complicate existing syntax.

@hell-racer
Copy link

hell-racer commented Apr 5, 2023

Also this way we can add regex replacement functionality, like regex_replace(foo, 'ba[rz]', 'replacement'). Maybe it's better to name match() function like regex_match() for consistency in this case.

@springcomp
Copy link
Contributor Author

@hell-racer functions are definitely possible and are currently the main way to extend JMESPath. Nothing precludes shipping a high quality library of functions. However, extending using functions poses its own challenges, for instance, when JMESPath is embedded into a third-party tool.

I think it makes sense to bring some key features into the core language that all library implementations must abide by. Hence my proposal using keywords instead. Although I’m certainly happy to listen to the pros and cons and hear about suggested alternatives.

At another level, although I would love JMESPath to have builtin ability to search, capture, extract and replace text using regular expressions, I realize that programming languages have somewhat incompatible dialects of regex. I’m not comfortable specifying a function whose behaviour depends on whatever programming language a particular library happens to be written in. For instance, this would make it impossible to share a common suite of compliance tests that all implementations can rely on to assess compliance.

This proposal is deliberate and focuses on a narrow but universal use case, which is to match only. This happens to be inline with the I-Regexp initiative that produces a spec that is unambiguous and easy to follow.

Maybe, with experience, we could foresee a future where that spec is extended to support capture groups and replacement.

So, to sum up, using functions rather than keywords is a matter of style.

Irrespective of the syntax we choose, I think having a defined common spec for what a valid regex syntax is in JMESPath is very important.

@hell-racer
Copy link

hell-racer commented Apr 6, 2023

For instance, this would make it impossible to share a common suite of compliance tests that all implementations can rely on to assess compliance.

Regex is an entire language by itself, so including all its functionality into compliance tests is somewhat unnecessary. So, I think we'll have to include some lowest common denominator into compliance tests. Also, when a developer uses JmesPath in some project written in some language, he or she also almost certainly uses Regex in that same language too, so it would be strange if the Regex itself works differently than Regex in JmesPath - so it's understandable if Regex in JmesPath works the same way that language/framework does. Does it make sense?

@hell-racer
Copy link

Another thing to keep in mind, if the regex will be implemented as a set of functions, developers can override the default implementation in their projects, whereas if it will be part of core functionality it would be impossible to override.

@springcomp
Copy link
Contributor Author

springcomp commented Apr 6, 2023

Regex is an entire language by itself, so including all its functionality into compliance tests is somewhat unnecessary.
So, I think we'll have to include some lowest common denominator into compliance tests.

Indeed, that is the exact point of defining a spec for an interoperable subset of the Regex language.
I-Regexp is a least common denominator chosen by the authors of the draft spec.

Also, when a developer uses JmesPath in some project written in some language, he or she also almost certainly uses Regex in that same language too, so it would be strange if the Regex itself works differently than Regex in JmesPath - so it's understandable if Regex in JmesPath works the same way that language/framework does. Does it make sense?

It makes total sense but I beg to disagree.

The purpose of promoting a spec for JMESPath is so that all implementations – irrespective of their differences – can agree on some common semantics and behaviour for the language.

I think the proposal is "elegant" – in all modesty – because:

  • As part of a specification, it is unambiguous and not subject to implementation details.
  • The draft comes with specified mappings to target real concrete Regex flavors as implemented in programming languages.

This means that from an implementation perspective, the only thing that’s required is to:

  • Check that the regex syntax is valid (as per I-Regexp)
  • Map the I-Regexp syntax to the target implementation-specific expression
  • Execute the implementation-specific expression.

That’s what I included in my reference implementations in Python and TypeScript.

from iregexp import check
from iregexp import toPCRE

import re

## check syntax is valid I-Regexp
succeeded = check('[aeiouy]*')

## returns PCRE-compatible expression
regex = toPCRE('.*', anchor = True)

re.compile(regex)
re.match('aaaa')

@gibson042
Copy link

So, to sum up, using functions rather than keywords is a matter of style.

I think I disagree with this... changing syntax is a much bigger deal that breaks forward compatibility of old implementations. There should generally be a bias towards functions, which have already-established syntax and can be retroactively added.

Irrespective of the syntax we choose, I think having a defined common spec for what a valid regex syntax is in JMESPath is very important.

💯 agree, and not just syntax but also semantics.

@springcomp
Copy link
Contributor Author

@gibson042 understood.

Although, for the record, my proposal does not break any syntax. It enables syntax that was previously not valid.

As for breaking forward compatibility with old implementations I understand. But, this ship has sailed, I'm afraid, with the recent approval of the let-expression into the core language 😉

Although, to be fair, the previous design using a let() function was not just a function but included a core redefinition of identifier evaluations so would not have been possible to replicate as precisely by frozen implementations anyway.

@gibson042
Copy link

I'm not saying syntax should never change, rather that it should change only when the alternative is impossible or impractical (as was the case with let). Generally speaking, adding a function to the standard library is the most preferable form of enhancement.

@springcomp
Copy link
Contributor Author

I’ve come to the common understanding that match and like functions would be better to give all existing implementations to be brought up to standards.

For the record, I would like to propose the following signatures:

bool match(string $subject, string $regexp)
bool like(string $subject, string $wildcard)

For the record this posts tracks the latest version of the spec:
https://www.ietf.org/archive/id/draft-ietf-jsonpath-iregexp-08.html

The assumption being that this proposal is subject of the linked-to specification attaining RFC status.

@springcomp
Copy link
Contributor Author

The proposal for an interoperable regex syntax is nearing RFC 9485 status. 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants