-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular expressions in filters #70
Comments
So you are saying JSONpath should only support regular expression literals, not strings with regular expressions. |
Unfortunately, no, I'm saying that at the time I wrote that sentence, I had forgotten that Perl/Python/Ruby et al support variables inside the "/" delimiters. I wish I could interpret it the way you suggested. |
Do you mean as in (1) interpolating variables into the RE ( So far, I haven't seen any proposal for variables in the JSONPath expression language that could go into either place. |
RE2 syntax would have the advantage that nobody of us likes it :-) |
@cabo wrote:
Perhaps an expression e.g. |
I would prefer to restrict the right hand side of a regular expression match ( |
After giving this some thought, I agree. Introducing "/" delimiters for regular expressions suggests static "compilation" of the regular expression, as it does in perl/ruby/python. If in JSONPath we're applying a regular expression in a search of a million items, we wouldn't want the regular expression to depend on the data, as that would defeat the purpose of introducing the "/" delimiters in the first place. We wouldn't want to force implementations to "recompile" the regular expression every time the current node changed. Also, I don't think my speculation above about allowing These observations are consistent with the Jayway Java implementation of the "~=" operator. In terms of functional requirements, regular expressions are very useful for search and filtering, but such expressions are generally statically known.
I don't fully understand this sentence. JSONPath expressions don't produce character literals, character literals are a static concept. JSONPath expressions can produce strings, but that's not the same thing. The key feature of character literals is that they don't depend on the data, and in particular on the current node
I don't think anybody would propose that! |
On 2021-03-15, at 12:51, Daniel Parker ***@***.***> wrote:
> I really would like to avoid literal regular expressions that contain JSONPath expressions as that would mean not being able to reuse an existing regular expression syntax (such as RE2).
>
I don't think anybody would propose that!
Well, then let me propose that :-)
Building a RE out of data that has been found in the input is not that outlandish, I think. I’m not sure that we ultimately want to support that, but it is not something that is self-evidently excluded.
Grüße, Carsten
|
Pardon my confusion, but what I had in mind was to allow the likes of |
@glyn wrote:
One minor point, I believe it should say that it evaluates to a string Otherwise, I believe that that's technically sound, and permits an efficient implementation, in the sense of "compile" regular expression once and apply many times. And it's somewhat more likely that there is a regular expression to be found at an absolute location in the JSON document, as opposed to in all current nodes :-) I still think it's an unlikely user requirement. If you do become persuaded to support this syntax, it probably won't be because I've tried to persuade you. |
On 2021-03-15, at 17:17, Daniel Parker ***@***.***> wrote:
And it's somewhat more likely that there is a regular expression to be found
The expression language could also contain something like RegExp.escape, and it is likely to be helpful to compose the regular expression out of constant and variable (from input item) parts. What’s out there in today’s implementations?
Grüße, Carsten
|
@cabo wrote:
As it turns out, no implementation supports |
I could get behind this. But I definitely wouldn't want to "construct" a RegEx from input data, as others have also been opposed to. |
On 2021-03-15, at 18:01, Daniel Parker ***@***.***> wrote:
Do you have some candidate syntax in mind that you would like to have investigated?
It would easy to come up with something, but I’d rather not invent something — if there is no current practice for this, we should not include it.
Grüße, Carsten
|
Agree. Reminder: Charter says we're supposed to specify something "based on the common semantics and other aspects of existing implementations". So there's a judgment call to be made as to whether enough of them support a common regex that we can reasonably claim it constitutes "common semantics". My impression had been No, but it's months since I looked at the overlap chart. |
I can tell you neither of mine support RegEx (one is obsolete now). But I'm open to adding it to the current one. |
I would really wish support for regex, and honestly I think they are nearly a must. We are working hard for making sure that any future implementation is going to behave consistently with any other which follows our work. I'm worried that all the complexity and differences behind regex is going to reintroduce back all the discrepancies we are trying hard to remove. I think that If we wish any regex support we need to go for a regex syntax which is well specified and comonly implemented. Honestly I'm not happy with RE2 that has been mentioned before, as far as I know it is available to most of languages as a C library through bindings (bindings might be really problematic). For instance I wasn't happy with RE2 library while working with Erlang and Elixir, among all the reasons, using it would force my JSONPath implementation to depend on native code using a NIF while the whole library is 100% Elixir compiled into BEAM byte code. IMHO we might investigate a Perl-like or POSIX-like regex support (that are quite popular). But again we should avoid getting stuck into all the regex complexities. I have also some further points, questions and doubts:
|
My 2 ¢: Any RE syntax we decide to use will need to be thoroughly documented in JSONPath, except if it is one of the ECMAscript versions or W3C XSD syntax. (I do not really consider Posix a candidate, it is a bit anemic.) Is it (1) string =~ RE, or (2) RE =~ string, or (3) both or even (4) both but with subtly different semantics (horror experience from the Ruby language)?
That is a sensible approach. Unless there is a large installed base that depends on computed REs, I agree. We could make sure that a later extension can be added seamlessly.
We have the luxury of being able to provide strong typing, so a RE literal should be a (syntax?) error except near a
Yes. If we can check a query before applying it to input, we should do that with failing fast. |
I tried to query this example [1] using a JSONPath with a XSD regex I tried using Jayway: fails to understand it and returns a "bogus" result (but it doesn't fail with the "regular" regex) Personal opinion about XSD regex: I feel like that no implementation is supporting them (I'm not aware of any), I feel like that every implementation supports whatever the language provides as builtin regex implementation (which is likely far from XSD regex syntax). Please, let me know if any implementation is supporting that syntax. If none supports that syntax, are we really sure to go for a syntax that nobody is using? Furthermore I didn't find any online tool to test regex with XSD style, conversely, PCRE style regex are easy to test (this is not really encouraging). I also did few tests with some online test tools, following implementations just fail to understand any regex (I tried with "basic" syntax
Personal opinion: previous results make me feel skeptic about the whole regex feature. Do we know which implementations are supporting regex? [1]: {
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "a",
"price": 8.95
},
{
"category": "fiction",
"author": "Evelyn Waugh",
"title": "s",
"price": 12.99
},
{
"category": "fiction",
"author": "Herman Melville",
"title": "x",
"isbn": "0-553-21311-3",
"price": 8.99
},
{
"category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
],
"bicycle": {
"color": "red",
"price": 19.95
}
},
"expensive": 10
} |
As far as I know XPath/XQuery regex are different than XML Schema ones:
I feel like there is a lack modern tooling for testing XML Schema flavor regex (outside XML schemas), and this might be quite frustrating for end users. By the way JSON Schema is borrowing syntax from ECMA 262. |
I did some further investigation into regex support (using cburgmer's tool and his implementations list). I usued a quite simple regex They are supported by 11 implementations in 7 different languages [1], interestingly JavaScript implementations were not supporting it, and 3/4 of them were completely misbehaving [2]:
I'm worried that users might expect to use regex to reliably validate/filter out invalid inputs, but they get accepted anyway. Instead Also I tested a XSD-like syntax, [1] List of implementations supporting
Go (2/6 implementations):
Java (2/2 implementations):
Objective-C (1/1 implementation):
Ruby (1/1 implementation):
PHP (3/4 implementations):
dotNET (1/4 implementations):
[2] JavaScript misbehaving implementations:
|
Interesting. While it may be a bit early to actually decide anything about regular expressions, this also seems to show some differences in expression evaluation. Can you do a similar check with |
@cabo: it's fine. Let me know what kind of output are you looking for, so I can summarize it in a helpful way. |
@bettio, what is your opinion about ECMA 262? All other things equal, and when it doesn't matter too much, I am strongly in favour of staying consistent with other standards, whether IETF recommendation or de facto standards such as JSON Schema. Daniel |
... this might be a way to go. Sharing a small, well defined subset of tc39 with JSON Schema. |
I think the purpose of the RFC is to give advice to implementers and users
as to what works. It's very unlikely that any existing implementation will
change its semantics because of our work - but new ones will follow it if
we do a good job. So I'd like the RFC to say something like "Some
implementations support regular expression filters in JSONPaths; maximum
interoperability can be achieved by using subset $X."
In particular, since regexes exhibit an 80/20 rule - you get 80% of the
benefit with 20% of the features. Another reason a modest subset seems like
a good idea for the JSONPath RFC.
…On Thu, Mar 25, 2021 at 11:36 AM Stefan Goessner ***@***.***> wrote:
... this might be a way to go. Sharing a small, well defined subset of
tc39 <https://tc39.es/ecma262/#sec-regexp-regular-expression-objects>
with JSON Schema.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#70 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEJE3AIFKRKWWMCYHT5TLTFN7DTANCNFSM4ZBSRVIQ>
.
|
Some historical context: JSON Schema specifies 262 because that was what was available when the feature was added. It maintains 262 because that's what implementations now support. I don't think this is cause enough for us to adopt it here. We should look at which RegEx specification serves our needs and use that. If it also happens to be 262, that's fine. On a side note of implementation support, neither of my implementations support RegEx, but I'm happy to add it. I expect other maintainers wouldn't complain about it if it became a requirement. |
The modest subset is what actually really works in a consistent manner: I don't see any useful reason about documenting just the "modest subset" (and I don't think it is worth it), also I think that is not compatible with our WG charter: I will quote Barry Leiba, that asked to change the original charter. The original charter:
Barry Leiba:
And the following has been proposed:
Honestly what I found hard was taking decisions about quite commonly implemented "extensions" and a number of corner cases, and users love extensions (such as regex) because they find them useful. @timbray: however if you are willing to work on it, we might document a "JSONPath Core" subset, that can be safely used across implementations for portability reasons, and for transitional purposes, but I don't agree about just limiting to it. Sorry for the OT, let's continue the regex topic. Edit: I wrote a quick proposal that tries to match the 2 point of view I see here (and in other discussions): #78 (comment) |
Talking about regex: I think we should start again from requirements. I'll start with some of them:
|
I think we all agree:
Fortunately, there's a clean syntactic and semantic separation between JSONPath and regular expression (de facto or de jure) standards. I suggest we then stipulate RE2 for its security properties and unicode support (with "SHOULD" language) and allow for other regular expression standards (with "MAY" language). JSONPath implementations that share a regular expression standard will interoperate on JSONPaths involving regular expressions. JSONPath implementations which don't share a regular expression standard may or may not interoperate on particular JSONPaths, depending on the regular expressions used. In terms of compliance testing, it may be sufficient to include a few simple regular expressions which will work "everywhere". |
I see your intent here, but I have to say that I'm not a fan of pure wrappers (especially as I work in .Net), and it seems that all of the implementations for RE2 are just wrappers for the C++ implementation. There are a lot of benefits to developers when a library is written in the language. I don't think we should ignore that. Let's make sure that this comparison is considered in this decision. |
Go has RE2 built in (https://golang.org/pkg/regexp/), but point taken. I agree that a wrapper is to be avoided, especially if it makes shipping a static binary impossible. Which other regex (de facto or de jure) standards support unicode? |
It's 2021, so all (notable) RE standards support Unicode. I think, for once, recent drafts from json-schema.org got this right: If nobody else wants, I'll do a PR with an ABNF definition of such a subset. |
Example StackOverflow question that could be solved using RegEx. |
Another question on the usage of RegEx in filters. |
I find it interesting that JMESPath does not support regular expressions in filters. See, for example, this issue. |
The discussion found in this issue pretty much is the reason why iregexp is the right approach. |
Is iregexp the right approach for JMESPath too? If so, I wonder if the JMESPath community would be interested in collaborating on iregexp? |
We could ask! We can use all the help we can get. |
Well, maybe I can get iregexp-02 out first... |
Sounds good! |
112 output:
|
Issue 17 proposed regular expressions in filters, which is supported by several implementations of JSONPath. This raises two questions:
One approach which addresses both these questions would be to adopt RE2 syntax as discussed in ReDOS attacks, although not all languages may yet have support for RE2.
The text was updated successfully, but these errors were encountered: