Duplicates in selector output #23

gregsdennis · 2020-09-25T05:59:57Z

Each selector may each return multiple nodes, or even the same node multiple times. This raises the question of whether (and if so, how) the resulting collection of nodes needs to be distinct, i.e. without duplicates.

An example:

JSON instance:
{
  "a": [
    "string",
    null,
    true
  ],
  "b": [
    false,
    "string",
    5.4
  ]
}

Path:
$.*[0,1]

The raw result set of this selector is ["string", null, false, "string"]. If we say that we expect only the distinct set of nodes, then the results would be ["string", null, false], removing the duplicate "string".

So the question up for discussion is, "Should selectors return distinct sets of nodes or are duplicates preferred?"

I had been considering a location-based distinction (based on the location of the value in the original JSON structure) as a possibility, but I don't think it makes much sense given the definition of a selector in Section 2.5, taking multiple nodes and outputing multiple nodes. A selector knows nothing about the original JSON structure that is being evaluated.

However for posterity, consider the following (rather contrived) path evaluated against the above instance:

$.*[0,:5]

In this case, the "string" value which is the first element of the a property would be selected twice, and the "string" value in the second element of the b property would be selected once, resulting in three "string" values in the results. For location-based distinction, only the value from a would be considered duplicate, leaving two instances of "string" in the result set.

Again, I don't think that this is a viable approach due to how a selector works and the information it's given. However, if we were to also require location as part of the output, then this would be a viable solution.

The text was updated successfully, but these errors were encountered:

remorhaz · 2020-09-25T14:14:18Z

As for me I haven't see any implementations that tries to remove duplicate values from result, and I don't think it's a good idea to introduce this behavior as default one. But maybe it's useful to provide some way to filter off duplicated values explicitly, like $.*[0,1].unique().

But your second case (about selecting the same node twice) is more interesting, because we have to provide some default behavior; I will investigate the implementations on this case.

gregsdennis · 2020-09-25T21:46:29Z

I think it might be useful to throw a scenario in the consensus project to get actual results of what the various implementations do.

gregsdennis · 2020-09-28T19:24:21Z

A scenario has been added. Just waiting on the report to regenerate.

It looks like the majority just don't support the example query. For those that do, it seems to be an even mix of all three approaches. No concensus implies no pressure to implement any particular scenario, but we need to discuss and agree on it.

glyn · 2020-10-01T04:11:58Z

An option is to allow some latitude in the spec and say that "duplicates MAY be removed, either by location and value". But I think in that case we'd also need to tie down when that applies. Should it just apply to unions? Could it also apply in other cases, e.g. when using [*] against an array containing duplicate values (e.g. because there are duplicate values in the input document)?

"MAY" will considerably complicate the CTS FWIW, especially if the CTS is purely declarative (e.g. cts.json) as that would put the onus on each CTS implementation to allow duplicates to be removed if appropriate.

Also, does allowing duplicates to be removed imply anything about ordering? Can non-duplicates change order, for example?

gregsdennis · 2020-10-01T05:40:16Z

I think it can apply to any case where the same value appears in the result more than once.

I've opened a new issue for discussing result sequence. I think we need to answer that before we can discuss how it affects duplicates.

gregsdennis · 2021-01-05T00:49:08Z

To expand on the reason for combining duplicates a little, I offer a more extensive example and a somewhat informal proof of why eliminating duplicates would be a good thing (when not including location in the output):

The instance is similar to Goessner's example, but I've added color to the books and moved the other items under a "toy" category so that I can search for array items and use an expression.

{
  "store": {
    "book": [ 
      {
        "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95      },
      {
        "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99,
        "color": "blue"
      },
      {
        "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99,
        "color": "red"
      },
      {
        "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99,
        "color": "green"
      }
    ],
    "bicycle": [
      {
        "color": "red",
        "price": 19.95
      },
    ],
    "tricycle": [
      {
        "color": "blue",
        "price": 9.95
      },
      {
        "color": "red",
        "price": 19.95
      }
    ]
  }
}

Now if we want the prices of all the things that are red, we can use $..[?(@.color == "red")].price. This results in [8.99, 19.95, 19.95] without removal and [8.99, 19.95] with removal.

Let's ignore the .price for now to see what how the duplicate values are produced.

The path $..[?(@.color == "red")] returns three objects:

// $.store.book[2]
{
  "category": "fiction",
  "author": "Herman Melville",
  "title": "Moby Dick",
  "isbn": "0-553-21311-3",
  "price": 8.99,
  "color": "red"
}

// $.bicycle[0]
{
  "color": "red",
  "price": 19.95
}

// $.tricycle[1]
{
  "color": "red",
  "price": 19.95
}

Notice that the results for $.bicycle[0] and $.tricycle[1] are the same. Any processing on further path segments will result in the same output. Therefore we should only need to perform further selection once on these values.

Duplicate removal resolves this problem.

Even when we include locations, this can be used as an internal optimization. An implementation could identify that these are the same object and keep track of all the locations which contain it.

mkmik · 2021-01-05T01:13:03Z

I'm confused. Why do $.toy[0] and $.toy[2] in your last example lack the name property?

gregsdennis · 2021-01-05T05:29:10Z

Ugh... Oversight of editing while authoring. I'll fix it.

Aaand now I have to rework it to get the duplicates again.

gregsdennis · 2021-01-05T05:34:42Z

@mkmik updated.

glyn · 2021-01-05T15:09:56Z

Even when we include locations, this can be used as an internal optimization.

I think this is trading off performance, in some fairly rare cases, against simplicity of the spec. I'd prefer to keep the spec simpler.

danielaparker · 2021-02-01T04:06:24Z

An option is to allow some latitude in the spec and say that "duplicates MAY be removed, either by location and value".

Allowing latitude defeats interoperability, which I had thought was the raison d'etre for the IETF. But I note there's no mention of interoperability as a design goal in the draft.

From previous discussions, I think it is only necessary to consider removal of duplicates by location (not value.)

Should it just apply to unions?

I don't think so. I think it should apply however duplicates arise, whether unions, wildcards, recursive descent. This also simplifies removal.

"MAY" will considerably complicate the CTS FWIW, especially if the CTS is purely declarative (e.g. cts.json) as that would put the > onus on each CTS implementation to allow duplicates to be removed if appropriate.

Or perhaps, either duplicates must be removed, or duplicates must not be removed, or implementations must default to one of these two, and provide an option for the other.

Also, does allowing duplicates to be removed imply anything about ordering? Can non-duplicates change order, for example?

I don't think it affects order at all. Given the data

{
    "books":
    [
        {
            "title" : "A Wild Sheep Chase",
            "author" : "Haruki Murakami",
            "price" : 22.72
        },
        {
            "title" : "The Night Watch",
            "author" : "Sergei Lukyanenko",
            "price" : 23.58
        },
        {
            "title" : "The Comedians",
            "author" : "Graham Greene",
            "price" : 21.99
        },
        {
            "title" : "The Night Watch",
            "author" : "Phillips, David Atlee"
        }
    ]
}

and selector

$.books[1,1,3].title

The results with duplicates are:

Values	Paths
"The Night Watch"	`$['books'][1]['title']`
"The Night Watch"	`$['books'][1]['title']`
"The Night Watch"	`$['books'][3]['title']`

The results without duplicates are:

Values	Paths
"The Night Watch"	`$['books'][1]['title']`
"The Night Watch"	`$['books'][3]['title']`

Clearly the second set can easily be obtained from the first without rearranging the items, see How do you remove duplicates from a list whilst preserving order? Note that implementations have some latitude as to when to remove duplicates. As long as both values and normalized paths are collected, the implementation could delay removal until the last step, before returning results to the user.

danielaparker · 2021-02-01T05:18:08Z

To expand on the reason for combining duplicates a little, I offer a more extensive example and a somewhat informal proof of
why eliminating duplicates would be a good thing (when not including location in the output):

Following up on gregsdennis's motivation for wanting to remove duplicates, there's a related use case that's applicable when a user wants to modify the original data based on addressing information obtained from a query. Such addressing information could be in the form of normalized paths, or it could be provided by references or pointers into the original data, in languages that support that. In either case, the issue is the same.

So suppose that a user queries for a collection of prices for all books in the store, and wishes to discount them all by twenty percent. If duplicates come back, if the same price gets marked down twice, that's going to be a problem. This is not entirely hypothetical, there are jsonpath libraries that are used in this way.

More generally, it's hard to think of use cases where duplicates would be beneficial. It's possible to think of use cases where they would be detrimental. For an implementation that collects normalized paths along with values, removal is easy. Of course, removal comes with a cost, and some users may prefer not to pay that cost, particularly as most straightforward queries don't result in duplicates anyway.

glyn · 2021-03-26T15:43:15Z

I know the jury is still out on whether we remove duplicate nodes, but I've prepared draft PRs for the compliance test suite (jsonpath-standard/jsonpath-compliance-test-suite#2) and the reference implementation (jsonpath-standard/jsonpath-reference-implementation#27) to handle removal of duplicate nodes. Any comments gratefully received.

goessner · 2021-03-26T16:07:57Z

I am preparing an issue (#88) about unions, where removing duplicate nodes is also a point

danielaparker · 2021-03-27T19:25:32Z

I have a question about duplicates.

Consider a root value

{
    "books": [
        {"title": "A Wild Sheep Chase"},
        {"title": "The Comedians"}
    ]
}

and suppose that a query returned two values

[
    [
        {"title": "A Wild Sheep Chase"},
        {"title": "The Comedians"}
    ],
    {"title": "A Wild Sheep Chase"}
]

corresponding to paths

[
    "$['books']",
    "$['books'][0]"
]

Would the node identified by "$['books'][0]" be considered a duplicate?

gregsdennis · 2021-03-27T22:52:53Z

Would the node identified by "$['books'][0]" be considered a duplicate?

No, we only consider duplicates at the top level of the returned nodelist. So the values to compare are an array and a string. That the array also happens to contain the string makes no difference.

mkmik · 2021-05-11T10:31:59Z

Consider:

{
  "x": {
    "a": {
      "a": {
        "b": 1
      }
    }
  }
}

and this jsonpath: $..a..b:

The first part of the jsonpath $..a selects nodes {"a": {"a": { "b": 1}}} and {"a": { "b": 1}}

applying ..b on that result set, produces the node 1 twice

cabo · 2021-05-11T10:34:27Z

Yes, and we need to agree whether:

(1) the duplicate nodes are always delivered (which is what we decided today)

(2) the duplicate nodes are never delivered (which is what we went into the discussion with)

(3) it depends on the implementation (which IMHO would be a disaster)

mkmik · 2021-05-11T10:59:21Z

IIRC today we were wondering whether:

(5) the behavior of the descendent selector may be a special case

cabo · 2021-11-10T10:57:52Z

Maybe we can record the resolution that there is no removal, neither of duplicate values nor of duplicate nodes.

gregsdennis · 2021-11-11T05:15:21Z

Maybe we can record the resolution that there is no removal

I'd be interested in how this resolution came about because it seems like the discussion here was leaning the other way: removing duplicates to some degree. To what degree was still under discussion.

cabo · 2022-01-17T23:31:58Z

Slides for 112 said this:

Discussion confused by the lack of distinction between duplicate values and duplicate nodes.
Pretty clear that we don't remove duplicate values.
PR #134 makes it less attractive to remove duplicate nodes.
Proposal: Don't.

We didn't have time to discuss this at 112, so let's confirm the resolution on the 2022-01-18 interim.

This was referenced Sep 27, 2020

Add scenario to check for handling of duplicate values in output cburgmer/json-path-comparison#71

Closed

"Union" could have more description, and maybe a new name #21

Closed

gregsdennis mentioned this issue Oct 1, 2020

Sequence of results #27

Closed

ghost added the processing-model label Mar 10, 2021

gregsdennis mentioned this issue Mar 18, 2021

Determinism? #60

Closed

goessner mentioned this issue Apr 30, 2021

Filter Expressions #64

Closed

cabo added the overtaken by events -- can we close this now? label Nov 9, 2021

cabo added the editorial label Jun 14, 2022

glyn self-assigned this Jul 14, 2022

glyn closed this as completed in a6fb8f7 Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates in selector output #23

Duplicates in selector output #23

gregsdennis commented Sep 25, 2020 •

edited

Loading

remorhaz commented Sep 25, 2020 •

edited

Loading

gregsdennis commented Sep 25, 2020

gregsdennis commented Sep 28, 2020 •

edited

Loading

glyn commented Oct 1, 2020

gregsdennis commented Oct 1, 2020

gregsdennis commented Jan 5, 2021 •

edited

Loading

mkmik commented Jan 5, 2021

gregsdennis commented Jan 5, 2021 •

edited

Loading

gregsdennis commented Jan 5, 2021

glyn commented Jan 5, 2021

danielaparker commented Feb 1, 2021 •

edited

Loading

danielaparker commented Feb 1, 2021 •

edited

Loading

glyn commented Mar 26, 2021

goessner commented Mar 26, 2021 •

edited by glyn

Loading

danielaparker commented Mar 27, 2021

gregsdennis commented Mar 27, 2021

mkmik commented May 11, 2021

cabo commented May 11, 2021

mkmik commented May 11, 2021

cabo commented Nov 10, 2021

gregsdennis commented Nov 11, 2021

cabo commented Jan 17, 2022 •

edited

Loading

Duplicates in selector output #23

Duplicates in selector output #23

Comments

gregsdennis commented Sep 25, 2020 • edited Loading

remorhaz commented Sep 25, 2020 • edited Loading

gregsdennis commented Sep 25, 2020

gregsdennis commented Sep 28, 2020 • edited Loading

glyn commented Oct 1, 2020

gregsdennis commented Oct 1, 2020

gregsdennis commented Jan 5, 2021 • edited Loading

mkmik commented Jan 5, 2021

gregsdennis commented Jan 5, 2021 • edited Loading

gregsdennis commented Jan 5, 2021

glyn commented Jan 5, 2021

danielaparker commented Feb 1, 2021 • edited Loading

danielaparker commented Feb 1, 2021 • edited Loading

glyn commented Mar 26, 2021

goessner commented Mar 26, 2021 • edited by glyn Loading

danielaparker commented Mar 27, 2021

gregsdennis commented Mar 27, 2021

mkmik commented May 11, 2021

cabo commented May 11, 2021

mkmik commented May 11, 2021

cabo commented Nov 10, 2021

gregsdennis commented Nov 11, 2021

cabo commented Jan 17, 2022 • edited Loading

gregsdennis commented Sep 25, 2020 •

edited

Loading

remorhaz commented Sep 25, 2020 •

edited

Loading

gregsdennis commented Sep 28, 2020 •

edited

Loading

gregsdennis commented Jan 5, 2021 •

edited

Loading

gregsdennis commented Jan 5, 2021 •

edited

Loading

danielaparker commented Feb 1, 2021 •

edited

Loading

danielaparker commented Feb 1, 2021 •

edited

Loading

goessner commented Mar 26, 2021 •

edited by glyn

Loading

cabo commented Jan 17, 2022 •

edited

Loading