-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object validation torture (AOR proposal) #88
Comments
I agree with you. If you want coffee or tea, it doesn't mean you want one of each. "If this is coffee, bring me tea; but if this is tea, bring me coffee." |
I know this seems weird, but in the spirit of JSON (and sometimes XML depending on opinion) all members not expressly forbidden are allowed. Or put a different way, unspecified members of an object are ignored. If you want that to work, you have to explicitly "close off" or "restrict" the object. Here's what you are doing:
But here it is with
If you don't want to close off the object, can you can explicitly change the logic from foo or baz to (foo and not baz) or (baz and not foo). Here is the failure to validate with that.
And here is the validation working with that logic:
|
While I agree with the idea that unknown members are ignored, at the moment I don't agree that known members that are superfluous are ignored. I think it comes down to whether | is a choice, in which you can only pick one, of n items, or whether it's an inclusive OR. xs:choice in XML schema would be the former. I think it becomes more of an issue when the data types are objects, such as in:
But I'm happy to leave that aside for the time being and see how things develop. |
Hmmm... interesting point. |
I'm trying to tease out of this if this is an issue for arrays, either ordered and unordered, and I don't think so. But that is more a property of the array not allowing extra items than it is being inclusive or exclusive or. |
I suppose you could have an array spec something like:
You wouldn't expect to allow both ipv4 and ipv6 addresses then. I'll confess I haven't looked at your array code yet, but my feeling at the moment is that ordered arrays need to be modelled more like regular expressions (except rather than characters you're matching types). I recall that Kernighan and Pike came up with a simple, cut-down RE engine that may be sufficient for this (I think this is it http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html). Unordered arrays are probably more like objects with members that have a null member name. |
I don't get the reference to the regex engine, but I've got a local branch going to work on XOR. So far I've modified object evaluation without too much trouble (minus new tests to prove it out, but all the old tests pass). I suspect that both ordered and unordered arrays would be the same. And I think you've hit the arrays from the right angle, and my gut feeling is that there won't be much change between ordered and unordered arrays. Let me play and see what I come up with. That said, it would be great if the choice combiner signifies XOR in all cases (arrays and groups and objects), not just for objects. |
After some thought and tinkering, I believe xor for objects and unordered arrays are similar. I've got it implemented for objects, and my guess is that it will take little effort for unordered arrays once I have the time to get to it. That being said, I'm unsure any special care needs to be taken for ordered (i.e. normal) arrays. Or said another way, there is no difference between inclusive or and exclusive or with arrays. And the same is true of primitives.
Since a primitive can be only one value, there is no opportunity to have a comparison of two values. And I think the same is true of arrays, since an ordered array is always comparing an item at a specific position. Take the following JSON array:
The item at position 2 is "foo", and there can only be one item at position 2. Now here is where things get interesting. Does the following rule match against the array above?
What about this rule?
From what I believe (and what my software is telling me), $a1 fails while $a2 does not. And the reason $a1 fails is because |
The idea that Maybe we need some sort of tri-state logic involving
A final result of Any comp sci whizzes out there that can comment?! |
On Thu, Sep 1, 2016 at 8:56 AM, Andrew Newton [email protected]
I think that's a very strong counterexample to the notion that | (choice) $a2 succeeds because at least one branch of the choice succeeds: after the But both sides can also succeed. If the test string is [13, "buzz"], then This is how string regular expressions behave, and we should not violate |
I agree that regular expressions are more like IOR. See this example: That said, I'm not sure regular expressions are a good ideal as lots of people hate them. Nor do I think following their guidance is particularly helpful here. Let me explain. If what I mean by rule:
In other words, IOR yields false positives and XOR yields false negatives. Unfortunately, that leaves us with two options: 1) delve into tri-state logic as @codalogic has hinted at, or 2) simply spell out the pitfall in the spec so people understand the issue. I'm not a fan of option 1 at present. Therefore I say we go with option 2, and that for the sake of the spec all choice combiners are XOR being that XOR makes sense for objects and unordered arrays and there is no good OR for ordered arrays other than to be consistent. |
BTW, not that I've thought this through but instead of 3 state logic, perhaps we can add a modifier to the repetition that indicates exclusivity. As in |
On Fri, Sep 2, 2016 at 11:18 AM, Andrew Newton [email protected]
So I do not understand what you mean by IOR and XOR, because it does not John Cowan http://www.ccil.org/~cowan [email protected] |
Sorry, I was fighting with the markdown syntax for table and made a mistake. I subsequently fixed it but that probably did not come through in email. Here it is with the empty case: rule:
I see your point, but I need to think about this because it appears that with IOR getting around the mixed case and the empty case take a lot more work than with XOR. |
I knocked up the following Ruby:
and got the following results:
Significantly (for me anyway :-) ), (Sorry must dash - people coming!!!) |
Thinking some more over the weekend, I think we can proceed by having two flags that record the result of evaluating a branch of a choice: In spec language we could say something like:
Effective XOR for content that is present, and IOR for content that is absent. |
Your example was an array. Does this apply to objects and unordered arrays as well? |
I think so. If we have:
I think the following should be valid:
But not:
Does that look reasonable? |
On Mon, Sep 5, 2016 at 6:49 AM, Pete Cordell [email protected]
That would mean that ["foo", "bar"] is not equivalent to ["foo", bar" | I do not understand the point of all this special pleading. Why not simply John Cowan http://vrici.lojban.org/~cowan [email protected] |
I disagree with that. If I have the regular expression Part of the problem is that REs are looking for a path across a sequence of token. So the expression above would be represented in a state machine as:
whereas at the moment we're trying to replicate that behaviour by looking for presence or absence on a token level. |
On Mon, Sep 5, 2016 at 1:28 PM, Pete Cordell [email protected]
In other words, the choice is satisfied if either the left or the right |
OK, I now understand what you mean by 'both match'. I was less bothered about patterns such as Where I think we have a problem is as follows. If we had:
should the following JSON be valid:
? I believe not. With IOR, a parser could see "f" present and "b" present and declare the choice valid. XOR was an attempt to say that if "f" was present, then "b" must not be. I think this is a key difference between ordered and unordered constructs. With ordered constructs (e.g. Unordered doesn't seem to map to regular expressions because there's no In the case of unordered, to get the result I want, I think when we get:
we have to modify it to:
Then we can use IOR logic. I would even go further and say, at least as a reference implementation (there maybe more efficient ways to do it), we flatten the entire expression into a single choice of sub-sequences. So we'd end up with:
Then if any branch is So to compute that, we could do the following steps:
The following would be considered valid:
The following would be considered invalid:
|
This may not help a lot, but this is how XML schema describes the validation of model groups (aka sequences, choice, all). JSON's unordered objects are more like |
I have to say, I think I'm lost. I thought we were all good with XOR logic on objects and unordered arrays. The issue, I think, is with arrays. Do we have a basic agreement at least on this? |
XOR was looking good until John pointed out that rules like:
would fail to validate the following JSON:
because both legs of the choice would evaluate to Granted, you could re-write the above to:
and have the input validate correctly, but it seems incorrect to have things that in Boolean logic terms are equivalent yield a different result depending on how you write it out. Hence the proposal of Plan C, to flatten out all rules to a choice of sub-sequences. One benefit of this is there's no chance of getting different behaviour depending on how you write the Boolean expression because it's always reduced to a common form first. |
So if I understand this, we're back where we started other than to explicitly state that the choice combiner means IOR. Am I right about this? The problem with Plan C, IMHO, is that it is too complicated to explain to JCR readers. Somebody troubleshooting why their JCR isn't working the way they want will probably be lost in the flattening rules. It seems to me that it is easier to say: 1) read the rules from left to right, 2) if one matches then "bingo!", 3) you must take into consideration zero-length matches (which, if they are familiar with regular expressions, they probably are already doing). |
On Tue, Sep 6, 2016 at 5:53 PM, Andrew Newton [email protected]
Agreed. I haven't thought enough about objects, still less unordered Are we trying to cater for JSON objects with non-unique keys? Original Let me know. John Cowan http://vrici.lojban.org/~cowan [email protected] |
On Tue, Sep 6, 2016 at 8:26 PM, johnwcowan [email protected] wrote:
As far as I can tell, the model is similar if not the same. We are not |
Yes, it would be useful to hear how RNG does it. The only non-unique keys like aspect in JCR is when a regular expression is used to specify member names (e.g. |
In principle the multiplying out / flattening is just the same as how you'd multiply out maths expressions. So (where
is analogous to:
which becomes:
or:
One problem I identified over night is how do you multiply out groups that have their own repetition count in addition to repetition counts on members within the group, e.g.:
I don't think it's insurmountable though. It might look like:
But if RNG delivers what we need I'm all for that. |
Relax NG's specification for choice is here: http://www.relaxng.org/spec-20011203.html#choice-pattern In rereading this thread, I can see that we have found cases where IOR and XOR have different utility with objects, and XOR is less useful with arrays. Here are some options:
|
I think it's saying that Given both of you are not happy with flattening the entire expression, I propose skipping that step, but still 'augmenting' the branches of the choice so that all of them contain the union of all members mentioned in the entire choice, annotating the ones that are added by the augmentation process as 'must be absent'. So:
is augmented and processed as:
Then do IOR. That allows:
But not:
Which I think is reasonable. |
It seems to me people would more instinctively write that as
Also when reading the first rule, are we telling people the |
While I suggest we treat the reduced for as:
and then do IOR of the branches.
If we tell people that
IMO objects and unordered arrays should be treated similarly. They're concerned by how often a member appears in the instance, and what follows what is not important. Ordered arrays need to capture what follows what, so we're following a path through the JSON instance and the above mechanism doesn't support that. |
I've been playing with your idea with examples on my whiteboard so that I can truly understand it, and going back to the examples you have given. That has been instructive for me. I think it is interesting that it acts like IOR in some cases and XOR in other cases, but ultimately I feel this is also its chief problem with respect to readability. As a user of JCR, to truly understand how the choice acts I have to rewrite the rules, and that is not a process I think most users will find natural or easy. By contrast, I think most people will find it to be easier to apply their existing knowledge of grouping and IOR and XOR to read and understand a rule. To me, this is similar to the regular expression test you provided in this thread: my reading of the regular expression did not match some of the results and it is not obvious to me why. I had to do some reading on zero-length matching and longest matches, etc... yesterday to try to understand it. I'm also somewhat concerned about the implementation. Supporting AND and IOR requires a simple loop through the rules with the ability to short circuit that loop. XOR requires a tad more work and requires at least two matches to short circuit, but all three can be done in the same loop (I'm sure there are other ways to implement it). To support the combined IOR/XOR logic, there needs to be a rule rewriting phase first then a run through the AND and IOR loop. This is more complex to implement and requires more runtime (I'm more concerned about the former than the latter to be honest). |
I think it's appropriate to differentiate between 'casual' users and implementers. The story you tell each one can be slightly different. So for users you can say something like: The
then the following JSON instances are valid:
The following is invalid:
For a rule where the solution space of the branches do overlap, e.g:
then the following JSON instances are valid:
Note that in the case of Then in Appendix ??? we go into the detail of augmentation etc. Regardless of what solution we adopt I think we should describe it as above. The casual user can then just rely on the examples, and draw on their experiences of regular expressions, XML schema, whatever. Only if they get a result they don't expect would they have to delve into the logic of the reference implementation. The less likely they are to get a surprising result, the less likely they are going to need to look into the detail. |
That's very good prose for describing the basics to a user. The actual specification I think is that each branch of a choice must contain the negation of the relative complement (set theory definition) of the rules of that branch with respect to the rules of each other branch, and the rules of these relative complements must have type That being said, I'm worried that we are creating a problem. Call it a gut instinct, but I worry because I cannot find any other schema language or the like that has solved this problem in a such a way, and I'm not a mathematician or familiar enough with set theory to prove that it will not be a problem. But I have been tinkering, and... Consider the following, where I'm using
The expectation being that these JSON objects are valid:
The first step is to reduce the inner most choices to IOR with relative complements:
Then repeat on the outer most choice:
From what I can tell, |
The other thing that gives me pause is the example from yesterday:
Another way to write this is:
Given |
I was proposing only augmenting with members that hadn't already been referred to in a branch, not sub-expressions. So I agree with your first expansion. For the second expansion I get:
So in that case I think all of the following are valid, as desired:
But the following would be invalid:
If we just did IOR, without augmentation, then the following would also be considered valid:
which just seems wrong to me :-) |
Regarding #88 (comment) , IMO |
WRT to the separate constructs, I agree but it is also in the class of things that should seem possible which is why we are discussing this new OR feature (we really need a name for it, like MERGE OR or MOR). It is to highlight this question: What is the behavior of WRT the subexpression exercise, I am now more confused. You said
Yet Additionally, if the operation only looks at the top OR, then there is no expansion of the sub-expressions, and so the resulting final expression is:
I don't think that's what we want as On the other hand, if we allowed for distinctions between the OR types so that XOR is allowed then the original expression could be written as (assuming
This would then get rewritten as
This would validate:
but not
|
When I look at:
there are 3
Starting with the first one, the union of all named members is:
If any of the union of members do not appear in both branches, they are inserted as
Similarly for the second
Substituting these into the third and final
The union of members is:
The left branch is missing $d, the right branch is missing $c. So the two branches become:
which leads to:
I'll call it But let's just go with IOR for the time being and I'll try to write some code to see how AOR works out in a bit more depth. (We should probably change the names of |
Oddly, I just worked that out on my whiteboard and was about to comment "nevermind". :) I believe the same works for unordered arrays. The only difference between the two algorithms is that for objects the augmented rules are always of type 'any' whereas with unordered arrays the type doesn't change (because the type is the item sought, whereas in objects the member name is being sought though a "match" also considers the type). Correct?
By this do you mean let's proceed with the current spec as is, and update it to AOR in the next revision? If so, I'm in agreement (though I would like to add a note to -07 hinting at this). And I too want to see if i can code this up now that I feel I understand the algorithm more completely. |
Hopefully it seems a lot simpler than you initially thought :)
That's exactly my thinking too. ORDERED arrays on the other hand need something else IMO.
Yes. Hopefully by then we'll be more confident with the algorithm (or not!), and maybe more community feedback on the expected behavior. |
Opening up my code editor this afternoon, some questions came to me on the AOR algorithm. I think I know the answers, but wanted to confirm them. Question 1: a
produces the following rewrite
correct? Question 2: like question 1 but with repeat max of 0
produces the following rewrite
correct? question 3: I think you answered this before, but what about possible repetition
produces the following rewrite
correct? For 2 and three, the algorithm is essentially the same: don't copy over the repetition into complement set. correct? question 4: with this last one I need to revert to straight JCR syntax
would NOT produce this
essentially, there would be no rewrite for this case because we determine what is in common based solely on the member name or regex. And if it is a regex, there is no regex normalization for the matching, right? In other words, there is no attempt to match question 5: like question 4 but for unordered arrays
produces the rewrite
correct? More to the point for unordered arrays, the type definitions in the complement are used "as-is". In other words, there's no attempt to say |
Both
Agreed.
Agreed.
Agreed
Agreed. Treat the regex as an opaque string. A case of "user beware". So the following would not match anything, but that's the coder's fault, not ours:
Agreed
I guess you mean if you have something like:
I need to think some more on choices of arrays and objects. It could be as simple as saying there's an array on both sides (don't care about the content), so no re-writing is required, and if any branch is true then its valid. Would be good to come up with some more use-cases, especially for objects.
|
Ok. I'll attack objects with the above first. And the see about unordered arrays after that. |
I'm trying to swap all this back in and start coding it. But here is the relevant part. I would appreciate verification that I'm at least on the right track. As of right now, AOR only applies to objects.
Here is a simple AOR to IOR example:
converts to
AOR acts as an Exclusive OR (XOR) in this simple case. Here is a more complicated example:
converts to
Here the valid JSON structures are:
but this is invalid
Now let's get more complicated by throwing in multiple levels of AOR:
coverts to
where the following is valid
but the following is not
Now for an even more complicated multi-level example of AORs
coverts to
where the following is valid
but the following is not
Given that the above is true, the following algorithm should be used in the rewrite:
This is rewriting the rules from the bottom to the top. The AOR to IOR rewrite is this:
Specifically unaccounted for in the original expression are member rules with @{not} and repetition max of 0. |
So it just occurred to me that my algorithm needs to take into account de-referencing rules when rule references are encountered, with a recursive dereference for group rules. |
AOR is ComplicatedAfter about 2 weeks of coding for AOR, I'm less comfortable with it than when I started. While I support the concept of AOR, the general algorithm needed to implement it is rather complex. Most of the implementation code for JCRValidator can be found here: https://github.com/arineng/jcrvalidator/blob/aor/lib/jcr/rewrite_aor.rb Around line 122 I outline the highlevel algorithm for implementing AOR. The code to implement is even much more complicated (though I admit about half of that complication is the use of the Parslet tree directly instead of an intermediate data model). Here are the things that make me uneasy. First, as mentioned in the algorithm steps, the need to dereference and flatten group structures I think can get rather complicated. I fear that we could easily have missed something in all this. Second, in working with AORs I feel that debugging JSON objects that don't validate requires quite a bit of work to determine what went wrong, because one must work with the rewritten rules. My code can print out the rewritten rules, but reading them can be a bit trying. In addition, since the rules are rewritten there is no way to point the end user to the exact spot where validation failed in the original rule. Third, you'll note that in the examples I put in the code comment it is my belief the algorithm implies that nested OR groups are to be treated like member rules in the rewriting, where the I do believe addressing the goals of AOR is beneficial, but after trying it I do not think AOR is the right approach. I do have an alternate proposal here. Moving forward, I'd like to set this issue aside for future consideration. Perhaps we can support AOR with an |
I've been thinking about validation of objects. If we have the following JCR:
should the following JSON be valid or not:
My feeling is not because the author is saying they want one or the other.
What do others expect?
The text was updated successfully, but these errors were encountered: