Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROUP BY support for jsonb array values #35848

Closed
JorritSalverda opened this issue Mar 18, 2019 · 8 comments
Closed

GROUP BY support for jsonb array values #35848

JorritSalverda opened this issue Mar 18, 2019 · 8 comments
Labels
A-sql-json JSON handling in SQL. C-investigation Further steps needed to qualify. C-label will change. no-issue-activity

Comments

@JorritSalverda
Copy link

JorritSalverda commented Mar 18, 2019

I'm trying to do a group by query using values in a JSONB array. What I'm trying to figure out what features still need to be implemented to do what's show below and what alternative to use in the meantime.

CREATE TABLE computed_pipelines (
  id INT PRIMARY KEY DEFAULT unique_rowid(),
  labels JSONB NULL
);

INSERT INTO computed_pipelines
  (labels),
  ('[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]');
INSERT INTO computed_pipelines
  (labels),
  ('[{"key": "group", "value": "a"}, {"key": "group", "value": "c"}]');
INSERT INTO computed_pipelines
  (labels),
  ('[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]');

SELECT
  l->>'key' as key,
  l->>'value' as value,
  COUNT(DISTINCT c.id) as nr_computed_pipelines
FROM
  computed_pipelines c,
  jsonb_array_elements (c.labels) l
GROUP BY
  key,
  value
HAVING
  nr_computed_pipelines > 1
ORDER BY
  nr_computed_pipelines DESC,
  key,
  value;

Describe the solution you'd like

I'd like the select query above to return the following result:

  key   | value | nr_computed_pipelines
+-------+-------+-----------------------+
  group | a     | 3
  group | b     | 2

Instead it returns no data source matches prefix: c which is due to correlated subqueries not being supported yet, see #24676

Describe alternatives you've considered

Just retrieving the values from the array without grouping works fine with

SELECT
  jsonb_array_elements (c.labels)->>'key' as key,
  jsonb_array_elements (c.labels)->>'value' as value
FROM
  computed_pipelines c;

As soon as I try

SELECT
  jsonb_array_elements (c.labels)->>'key' as key,
  jsonb_array_elements (c.labels)->>'value' as value
FROM
  computed_pipelines c
GROUP BY
  key,
  value;

It returns column "key" does not exist. I think (partially) due to the fact that using aliases in group by, etc isn't supported yet. See #28059.

And if I try the following to work around this

SELECT
  jsonb_array_elements (c.labels)->>'key' as key,
  jsonb_array_elements (c.labels)->>'value' as value
FROM
  computed_pipelines c
GROUP BY
  jsonb_array_elements (c.labels)->>'key',
  jsonb_array_elements (c.labels)->>'value';

It fails with jsonb_array_elements(): generator functions are not allowed in GROUP BY. On this topic I found closed issue #10520.

An alternative I though about as well is to store the result of the non-grouped query in a temporary table, but for that #5807 needs to be implemented.

Are there any other features that need to be implemented before the original select query is possible? And what are the alternatives currently? Right now I don't group, count and sort in my query, but have to do it in code instead.

@awoods187
Copy link
Contributor

awoods187 commented Mar 18, 2019

I just confirmed that this does not work on master either. The inserts above didn't work for me until I modified them:

INSERT
INTO
    computed_pipelines (labels)
VALUES
    (
        '[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]'
    );
INSERT
INTO
    computed_pipelines (labels)
VALUES
    (
        '[{"key": "group", "value": "a"}, {"key": "group", "value": "c"}]'
    );
INSERT
INTO
    computed_pipelines (labels)
VALUES
    (
        '[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]'
    );

What's happening here @justinj ?

@JorritSalverda
Copy link
Author

JorritSalverda commented Mar 18, 2019

Oh sorry, I didn't test the insert query, just did them from the top of my head. I removed the incorrect braces. But can probably be merged into the following:

INSERT INTO
  computed_pipelines (labels)
VALUES
  (
    '[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]',
    '[{"key": "group", "value": "a"}, {"key": "group", "value": "c"}]',
    '[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]'
  );

But the inserts themselves are not really the issue here, just included them to illustrate my point :)

@awoods187 awoods187 added the A-sql-json JSON handling in SQL. label Mar 18, 2019
@awoods187 awoods187 added the C-investigation Further steps needed to qualify. C-label will change. label Mar 18, 2019
@justinj
Copy link
Contributor

justinj commented Mar 18, 2019

The problem is what @knz describes in the linked issue which is more about the implicit lateral syntax than it is correlated subqueries.

Here's a somewhat gnarly query that I think does what you want:

SELECT
    key, value, nr_computed_pipelines
FROM
    (
        SELECT
            key, value, count(DISTINCT id) AS nr_computed_pipelines
        FROM
            (
                SELECT
                    l->>'key' AS key, l->>'value' AS value, id
                FROM
                    (SELECT *, jsonb_array_elements(labels) AS l FROM computed_pipelines)
            )
        GROUP BY
            key, value
    )
WHERE
    nr_computed_pipelines > 1
ORDER BY
    nr_computed_pipelines DESC, key, value

You're right that #28059 is also a blocker to making this query a bit nicer.

@awoods187
Copy link
Contributor

Here is the lateral issue for tracking purposes #24560

@JorritSalverda
Copy link
Author

Cool the nasty query works! For making sure i'm actually unnesting an array I need to check the jsonb type, so the most inner select query becomes:

(SELECT id, jsonb_array_elements(labels) AS l FROM computed_pipelines where jsonb_typeof(labels) = 'array')

Should we keep this open until the dependencies are implemented? Or close it so it doesn't bloat your backlog?

@knz
Copy link
Contributor

knz commented Mar 18, 2019

I think at the very least we should integrate this query into the QA / unit test suite when the dependencies are implemented. I'm not sure how to best do this -- should the issue remain open so that the future implementor can look at it? @jordanlewis please advise.

@github-actions
Copy link

github-actions bot commented Jun 4, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@jordanlewis
Copy link
Member

The limitation here has been lifted. The OP's query still doesn't work, but that's because you're not allowed to reference an aliased group by column in a having clause even in other db's like postgres.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-json JSON handling in SQL. C-investigation Further steps needed to qualify. C-label will change. no-issue-activity
Projects
None yet
Development

No branches or pull requests

5 participants