Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determinism? #60

Closed
cabo opened this issue Mar 10, 2021 · 33 comments
Closed

Determinism? #60

cabo opened this issue Mar 10, 2021 · 33 comments

Comments

@cabo
Copy link
Member

cabo commented Mar 10, 2021

Does applying a JSONPath query to a JSON data item always generate the same result?

Some selectors do not have a natural order of the nodelists they return, e.g., $.* applied to a JSON object

We could prescribe an order (e.g., sorting keys lexicographically or in JavaScript order) if determinism is important

We could disallow .* on JSON objects

@goessner
Copy link
Collaborator

goessner commented Mar 10, 2021 via email

@glyn
Copy link
Collaborator

glyn commented Mar 10, 2021

I can't see a benefit of always guaranteeing the same -whatever- order.

A minor benefit would be that the Compliance Test Suite would be simpler if the output always had a defined implementation-independent ordering.

And disallowing iterating over objects by '$.*' might break existing implementations

Agreed.

@cabo
Copy link
Member Author

cabo commented Mar 10, 2021

(Determinism in interop testing is a major benefit for me. But that can be added by reprocessing after the JSONpath processing, so it is not completely a killer argument.)

@glyn
Copy link
Collaborator

glyn commented Mar 10, 2021

Another selector which requires careful definition for ordering is .., particularly when this recurses over an object or objects.

@goessner
Copy link
Collaborator

hmm ... compliance testing is a point, sure. Simple first tests of $.obj.* and $..* regarding ordering should be possible with current test suite ... ?

@glyn
Copy link
Collaborator

glyn commented Mar 10, 2021

hmm ... compliance testing is a point, sure. Simple first tests of $.obj.* and $..* regarding ordering should be possible with current test suite ... ?

Anything is "possible". ;-)

Tests involving objects which have only one name/value pair are trivially easy to test because any non-determinism of object enumeration is irrelevant. But tests involving objects with two or more name/value pairs need to capture the ordering possibilities somehow. Note that the current test suite is a JSON document so there would need to be a declarative way of specifying the ordering possibilities.

@gregsdennis
Copy link
Collaborator

We covered this topic over in the JSON Schema repo somewhere with quite a lengthy discussion. I'll see if I can dig it up.

In short, parsers in some languages just can't be made to be deterministic. Reading the same object from a file (where the keys are certainly in a defined sequence) will yield differing orderings when printing or even debugging the value.

This topic, therefore, falls into the "language agnosticity" category. I don't think it's something we can enforce.

@bettio
Copy link

bettio commented Mar 12, 2021

Some selectors do not have a natural order of the nodelists they return, e.g., $.* applied to a JSON object

We could disallow .* on JSON objects

$.* is accepted (and correctly evalueted) by ~93% of the implementations on the JSONPath comparison.

Each of them has a number of users, it is really important to not disrupt their existing expressions and their habits.

We could prescribe an order (e.g., sorting keys lexicographically or in JavaScript order) if determinism is important

I don't think determinism is that important, $.* on a object evals to an array but semantically it is an unordered set (which is not a JSON type, therefore a different type is used ¹) which is still a meaningful result and developers should remember that. Hence we shouldn't discard this feature.

Anyway what I found confusing when implementing ExJSONPath was the lack of any official recommendation about $.object.* on a object behaviour.

I think we should clearly state that the result has to be treated as unordered but we might suggest using lexicographic order for validation purposes when possibile (I think a huge number of implementations are using ordered maps for JSON objects storage so it shouldn't be hard).

Determinism in interop testing is a major benefit for me.

Removing an implemented fature used from a number of users for testing purposes is something that I wish to not discuss.

hmm .. depends, how deterministic JSON parsers are when iterating over
objects ... they should work in ascending chronological order ...

I think a number of JSONPath implementations are completely decoupled from the JSON parser, and they just accepted maps/lists/atomic-values as input (ExJSONPath is one of them, it doesn't depend on any JSON parser and it doesn't care about JSON parsing).

I'm not even sure if JSONPath is only used for JSON inputs, it works so nicely with any JSON-like structure.

¹ I'm not 100% sure that all implementations are using JSON types, there might be implementations using unordered sets for real as containers for $.object.* results.

@gregsdennis
Copy link
Collaborator

Each of them has a number of users, it is really important to not disrupt their existing expressions and their habits.

A noble goal, to be sure. But I imagine that sudden conformance to a spec would natural engender breaking changes, especially in a landscape of Implementation as large as we're dealing with. I don't think it'll be avoided.

As such, my feeling is that we get the spec right rather than make concessions just because the majority does it some way.

I think a number of JSONPath implementations are completely decoupled from the JSON parser...

Absolutely correct.

@cabo
Copy link
Member Author

cabo commented Mar 12, 2021 via email

@cabo
Copy link
Member Author

cabo commented Mar 13, 2021

I think the summary so far is

  • We recognize that there are operations in JSONPath whose result is ordered but that order needs to be created based on unordered input.
  • We are not going to cut off these noses to spite our face.
  • We have a choice between imposing a language-defined order (e.g., sorting) or leaving the result non-deterministic.

I gather that there is a general direction toward non-determinism, as imposing an order would be too expensive, but I'm not sure we have a conclusion here.

Also: https://mailarchive.ietf.org/arch/msg/jsonpath/N_emdQBtZg3xTuVQBFyXcjAR4ZE

@timbray
Copy link
Contributor

timbray commented Mar 16, 2021

I agree with Carsten's general direction in support of nondeterminism, simply because 8259 explicitly discourages assigning semantics to the ordering of object members. If we want determinism, we probably have to specify RFC8785 (JCS) conformance, which doesn't feel reasonable. As for testing, anyone who's doing this seriously in the real world has to have an JSONObject.equals() method already, right?

@glyn
Copy link
Collaborator

glyn commented Mar 17, 2021

I also agree with supporting non-determinism, but please note that JSONObject.equals() won't necessarily help because the non-deterministic portions of the output won't necessarily be part of a JSONObject. For example, applying .* to an input consisting of a JSON object will produce a node list in a non-deterministic order.

I think we'd either need to decorate such tests with an indication of where non-determinism can occur so that each test framework can compare the output to the expected value and take into account non-determinism or we'd need to have a way of capturing a set of possible alternative outputs. The latter may be preferable as it would complicate the generation of tests, which is a one-off process, but would simplify implementing test frameworks, of which there are likely to be many. (There may be other solutions too - haven't spent very long on this.)

@gregsdennis
Copy link
Collaborator

FWIW, in implementing JSON Schema, I've found that I need both contents-equality and sequence-equality on arrays depending on the circumstance.

Given that JSON Path outputs an array, and the contents of that array are likely to be generated non-deterministically due to evaluating objects, I think it makes sense for implementations to verify evaluation results (e.g. when processing a test suite) in a "contents-equality" sense. It's not per the JSON definition of array equality (which mandates sequence-equality), but I think the circumstance fits.

The fact that the result array is not guaranteed to be deterministic would need to be declared in the specification, probably associated with a remark about it deviating from JSON array equalilty and perhaps even an explanation why.

@cabo
Copy link
Member Author

cabo commented Mar 17, 2021

Nodelists are not arrays.
(They can be represented in a JSONPath API as arrays. Or as lists. Or as tuples. ...)

@goessner
Copy link
Collaborator

I do agree with common non-determinism here. Naming results as nodelists is clever. Shouldn't we define that term 'nodelist' then ?

@cabo
Copy link
Member Author

cabo commented Mar 17, 2021

Yes, we should define nodelist. I made a first attempt in PR #72 now.

@gregsdennis
Copy link
Collaborator

We need to tread carefully here. Are we saying we're comfortable outputting something that's not JSON but acts really similar?

All implementations (including @goessner) output an array (or optionally false) as JSON. If we suddenly call the output a "nodelist" are we breaking that?

@cabo
Copy link
Member Author

cabo commented Mar 17, 2021

If we suddenly call the output a "nodelist" are we breaking that?

No. I agree that many APIs will output platform arrays here (which may or may not feel similar to JSON arrays). Saying that the output is a nodelist is gives them freedom to define their API and how these present the nodelist.

More importantly, we make processing for embedded queries less finicky.

@cabo
Copy link
Member Author

cabo commented Mar 18, 2021

Daniel, nobody can (or will try to) stop you thinking about a nodelist as a JSON array, if that suits your mental model.
Giving a separate name to that concept does have its advantages.
It, for instance, helps finding answers to the role of nested queries in filter expressions.
It also helps reflect the fact that some implementations handle nodelists different from others.
So I don't know why we are having this discussion.
(And why is it under "Determinism?".)

@gregsdennis
Copy link
Collaborator

gregsdennis commented Mar 18, 2021

It also helps reflect the fact that some implementations handle nodelists different from others.

This is irrelevant because of many of the points that @danielaparker mentioned: essentially, implementations don't use nodelists from what we can tell, they use arrays. And isn't our charter to break existing implementations as little as possible? If all implementations use arrays, changing that is not something we should consider lightly.

@cabo We're discussing it here because of your comment. Following some discussion elsewhere about expressions, you spammed "nodelist" comments anywhere it might be related in order to promote the idea before anyone had a chance to discuss it. So now we're discussing it.


All of that said, I think we should consider the output format (#23 & #44), determinism (this issue), and paths embedded in expressions (#74) together. There is some subtle interplay going on.

If the output is an array of objects that specify location and value, then both that format and the sequence of the items within it need to be considered when returning from an embedded path so that the remainder of the expression can handle it properly.

If embedded paths are to follow some different output that operators can then use, we need to specify what that is as well. Maybe that is the idea of a nodelist. But we need to discuss it first and come to some level of agreement.

@cabo
Copy link
Member Author

cabo commented Mar 18, 2021

It is a pretty common technique in model building to identify important concepts, and once one has identified them, to give them (tentative) names. That allows the participants to refer to the concept-in-formation, and to gradually increase certainty about it. Once there is enough certainty, it may be possible to identify some concepts with each other and turn them into one, but you don't do that before achieving that certainty.

I think much of the confusion comes from a perception that whatever is in the draft at one time is cast in concrete. That is not a healthy way to run this, and maybe I was making an unwarranted assumption that this would be obvious.

So, indeed, feedback is good, but not so much about reasons wanting to eliminate/equate concepts prematurely and end the discussion about them before understanding them, but more about the properties of the concept itself. The fact that XPath arrived at a specific form of nodelist after trying nodesets is certainly interesting feedback. We may not need that complexity, but it is much easier to eliminate complexity once we understand the concepts.

@glyn
Copy link
Collaborator

glyn commented Mar 19, 2021

Given that JSON Path outputs an array, and the contents of that array are likely to be generated non-deterministically due to evaluating objects, I think it makes sense for implementations to verify evaluation results (e.g. when processing a test suite) in a "contents-equality" sense. It's not per the JSON definition of array equality (which mandates sequence-equality), but I think the circumstance fits.

I'm not sure this is sufficient. Let me explain why with an example.

Take the input argument:

{ "a": {"c": 0, "d": 1},
  "b": {"e": 2, "f": 3}
}

and the selector $.*.*.

I think the valid outputs should be:

  • [0, 1, 2, 3]
  • [0, 1, 3, 2]
  • [1, 0, 2, 3]
  • [1, 0, 3, 2]
  • [2, 3, 0, 1]
  • [2, 3, 1, 0]
  • [3, 2, 0, 1]
  • [3, 2, 1, 0]

But I think the following output, for example, should be invalid:

  • [0, 2, 1, 3]

The spec should allow for this level of non-determinism, but not more.

Similarly, if the Compliance Test Suite covers this example, it should allow precisely the valid outputs to pass. I don't see how a "contents-equality" comparison of arrays would help. This makes me favour encoding sets of possible outputs in the CTS as described in #60 (comment).

@cabo
Copy link
Member Author

cabo commented Mar 19, 2021

  • [0, 1, 2, 3]
  • [0, 1, 3, 2]
  • [1, 0, 2, 3]
  • [1, 0, 3, 2]

Actually, [2, 3, 0, 1] etc. as well (the outer container also is a map).

@glyn
Copy link
Collaborator

glyn commented Mar 19, 2021

Actually, [2, 3, 0, 1] etc. as well (the outer container also is a map).

Whoops! Yes. Edited the list so it is now complete.

@gregsdennis
Copy link
Collaborator

(@glyn) But I think the following output, for example, should be invalid:

  • [0, 2, 1, 3]

I'm curious why you think this should be invalid. I understand how you got the result, and if you include (or replace with) the paths to those values, it's just as (or more) clear.

Partial determinism is going to create a bunch of weird edge cases that will be impossible to completely define.

@cabo
Copy link
Member Author

cabo commented Mar 19, 2021

That of course depends on the semantics of the nodelist. If, as I'm assuming, it is sequence-preserving, you cannot arrive at [0, 2, 1, 3], which would be (JSON pointer syntax, I'm lazy) /a/c, /b/e, /a/d, /b/f.

@goessner
Copy link
Collaborator

goessner commented Mar 19, 2021

This is quite a complicated discussion. I'm trying to sum things up:

  • JSON objects as maps are unordered by nature.
  • Due to that, JSONPath query results are non-deterministic regarding their order – which is common sense here.
  • But determinism i.e. reproducability of JSONPath query results is required for compliance tests.
  • Now implementations usually return query results as an array – pragmatically JSON in ... JSON out.
  • JSON arrays are ordered by nature, which might falsy signal determinism of results.
  • So calling the result a "node-list" might help here out, but introduces another level of complexity in terms and their usage.

Having understood, that determinism is solely required for compliance testing, I am not sure, how to interprete Glyn's example above. Should future compliance tests really tell about every possible valid and invalid result ordering?

I am sure someone of us would be able to implement something, exactly reproducing [0, 2, 1, 3] as result (noone would of course). Thus it's implementation dependent.

But if we only need to insure, to always generate a single reproducable result, we can achieve this by something like ...

  1. Generate a list of output pathes.
  2. Sort it lexically or however.
  3. Generate the list of values from that path list – preserving order.

... which we can even communicate in the draft as a recipe.

Regarding "node-list", there is indeed in JavaScript a concept of "array-like objects". Those objects only need to be able to tell the list length and provide an iterator to visit all entries. Defacto dom node-lists are exactly such things. This is what above discussion also reminds me on.

Maybe I still miss another important point.

@glyn
Copy link
Collaborator

glyn commented Mar 19, 2021

* But determinism i.e. reproducability of JSONPath query results is required for compliance tests.

Neither determinism of the spec nor determinism for a given JSONPath implementation are necessary for compliance tests, but non-determinism does add some complexity to testing. That said, I am in favour of non-determinism in the spec because it gives the most freedom to implementations.

@glyn
Copy link
Collaborator

glyn commented Mar 19, 2021

(@glyn) But I think the following output, for example, should be invalid:

  • [0, 2, 1, 3]

I'm curious why you think this should be invalid.

Because I would expect each "stage" of a selector to be evaluated in turn. In other words, the intermediate results which are output from one stage to the next can be in a non-deterministic order, but that order won't be modified by other stages as those stages are operating on the individual values from earlier stages.

I understand how you got the result, and if you include (or replace with) the paths to those values, it's just as (or more) clear.

Sorry, but what is more clear?

Partial determinism is going to create a bunch of weird edge cases that will be impossible to completely define.

I don't think so. I think once we have addressed the non-determinism of specific selector "stages" applied to objects, that's pretty much it. What you describes as "partial determinism" is simply a result of how the spec is constructed from small pieces in a uniform way.

@glyn
Copy link
Collaborator

glyn commented Mar 19, 2021

To be clear, non-determinism in the spec doesn't force implementations to be non-deterministic. But it allows implementations to trade off efficiency against non-determinism. It also allows for variations between (deterministic) implementations.

@gregsdennis
Copy link
Collaborator

Also related to (maybe duplicates) #27.

@cabo
Copy link
Member Author

cabo commented Nov 10, 2021

Closed by #134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants