Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broaden the join types that can be used with UNNEST. #8200

Open
JonNorman opened this issue Jun 6, 2017 · 12 comments
Open

Broaden the join types that can be used with UNNEST. #8200

JonNorman opened this issue Jun 6, 2017 · 12 comments

Comments

@JonNorman
Copy link

This is the first time I've raised an issue in this repo, please let me know if this should be raised elsewhere / stated differently / other pre-requirements.

TL;DR: When using UNNEST, joins other than CROSS JOIN should be available to limit the rows being returned.

When expanding an array (using UNNEST and CROSS JOIN), it is often necessary to limit the result set for each row with a subsequent where clause. This can become a performance issue when the expanded array has many elements in it and there are many rows being unnested and returned before then being discarded.

Currently this has to be done as follows:

-- we only want items below 50
select  name, item
from    (
        values
          ('Sarah', sequence(1, 10000000, 1)),
          ('Jose', Array[1, 10000000, 1])
        ) as results (name, items)
cross join unnest(items) as t (item)
where   item < 50;

I want to be able to write:

-- we only want items below 50
select  name, item
from    (
        values
          ('Sarah', sequence(1, 10000000, 1)),
          ('Jose', Array[1, 10000000, 1])
        ) as results (name, items)
left join unnest(items) as t (item) on item < 50;

or with inner join.

Executing the above statement produces the following error:
UNNEST on other than the right side of CROSS JOIN is not supported.

Is there a good reason for this restriction? I can't find any discussion of this in the issues or on other forums and would suggest that the expected behaviour of using different joins is clear and should be supported.

@stale
Copy link

stale bot commented Jun 6, 2019

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

@stale stale bot added the stale label Jun 6, 2019
@stale stale bot closed this as completed Jun 13, 2019
@akshaymahajans
Copy link

Is there any plan for support on this to be added?

@ThirukkaKarnan
Copy link

https://stackoverflow.com/a/44927437 - Using left join is the only way to prevent unnest from dropping the rows with null values for the column that is unnested. Kindly prioritize this ask.

@shixuan-fan
Copy link
Contributor

Hey thanks for raising this up. I'm not aware of active development on this issue, but I'll reopen it for tracking. If it is specified in SQL spec, then I don't see a reason it should not be supported, but I'll let the committers chime in, and if they feel okay with this issue, it should be ready to pick up.

@shixuan-fan shixuan-fan reopened this Oct 9, 2019
@stale stale bot removed the stale label Oct 9, 2019
@rongrong
Copy link
Contributor

rongrong commented Oct 11, 2019

https://stackoverflow.com/a/44927437 - Using left join is the only way to prevent unnest from dropping the rows with null values for the column that is unnested. Kindly prioritize this ask.

This seems to be reasonable. But the motivation seems to be different from the original issue. I'm wondering what's the motivation for the original (with a filter on the unnest rows). on item < 50 doesn't look like standard SQL.

@JonNorman
Copy link
Author

Just to hop back onto this...

on item < 50 doesn't look like standard SQL.

I think it would be valid SQL - conditions can be placed in a join condition that relate to only one side of the join. This is most useful in an inner join since the conditions act as a join conditions and, effectively, filters.

I'm wondering what's the motivation for the original (with a filter on the unnest rows).

As above, I think that the most illustrative example would be when using an inner join. Instead of expanding every row and then filtering it out, we would only be joining to specific values in the array and then returning those. Unless this inefficiency is already obviated by existing join/filter optimisation logic, I think this would be an improvement.

@rongrong
Copy link
Contributor

@JonNorman The ON part of JOIN defines the condition of the join. I don't think SQL spec necessarily dictates the implementation. "Instead of expanding every row and then filtering it out, we would only be joining to specific values in the array and then returning those." is an implementation optimization. We can potentially do this optimization without changing semantics. Theoretically query engine can decide that we can push down item < 50 to UNNEST in

select  name, item
from    (
        values
          ('Sarah', sequence(1, 10000000, 1)),
          ('Jose', Array[1, 10000000, 1])
        ) as results (name, items)
cross join unnest(items) as t (item)
where   item < 50;

As long as the query engine is smart enough. Presto currently has a quite naive (and inefficient) implementation of UNNEST. Whether we should optimize its performance should be discussed orthogonally from whether we should support more join types.

@JonNorman
Copy link
Author

We can potentially do this optimization without changing semantics. Theoretically query engine can decide that we can push down item < 50 to UNNEST.

@rongrong I totally agree with you here and perhaps that is already happening - it just isn't clear that this is happening when writing the SQL. Theoretically I suppose we could remove general inner join support and use predicate pushdown on a left join and where with the same performance, but I think we'd all agree that this would be suboptimal!

For me the argument for why we want different join types in general applies for why we'd want it when using UNNEST.

At a high level, my motivation for raising this issue is that I it's my opinion that the semantics of a join should be the same, whether we are joining to a table/subquery/unnested dataset, and this issue just highlights an area in which I found that they diverge (for seemingly no reason other than it hasn't been implemented).

@rongrong
Copy link
Contributor

I agree. But again, supporting more join types is a separate topic from the example you gave, which is more related to improving UNNEST performance. Both are valid issues. Do you want to work on these?

@JonNorman
Copy link
Author

Hi @rongrong - apologies for dropping this one. I would like to contribute, but am not able to do so at the moment. I'm keeping this on my list though so hope to be able to at some point.

Whilst I agree that they are two different issues, I think that they are closely related, and so I'd be most interested in expanding join support semantics. This is because I expect (perhaps naively!) that by allowing for explicit inner join during an unnest, it would lead to performance benefits as compared to the current behaviour.

@CONSULTANTSF
Copy link

Any chance that LEFT JOIN UNNEST will be added to Presto?

@linghengqian
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants