The subquery command should be implemented using a clean, logical syntax that integrates with existing PPL structure.
source=logs | where field in [ subquery source=events | where condition | fields field ]
In this example, the primary search (source=logs
) is filtered by results from the subquery (source=events
).
The subquery command should allow nested queries to be as complex as necessary, supporting multiple levels of nesting.
Example:
source=logs | where id in [ subquery source=users | where user in [ subquery source=actions | where action="login" | fields user] | fields uid ]
For additional info See Issue
source = outer | where a in [ source = inner | fields b ]
source = outer | where (a) in [ source = inner | fields b ]
source = outer | where (a,b,c) in [ source = inner | fields d,e,f ]
source = outer | where a not in [ source = inner | fields b ]
source = outer | where (a) not in [ source = inner | fields b ]
source = outer | where (a,b,c) not in [ source = inner | fields d,e,f ]
source = outer a in [ source = inner | fields b ]
(search filtering with subquery)source = outer a not in [ source = inner | fields b ]
(search filtering with subquery)source = outer | where a in [ source = inner1 | where b not in [ source = inner2 | fields c ] | fields b ]
(nested)source = table1 | inner join left = l right = r on l.a = r.a AND r.a in [ source = inner | fields d ] | fields l.a, r.a, b, c
(as join filter)
SQL Migration examples with IN-Subquery PPL:
- tpch q4 (in-subquery with aggregation)
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '1993-07-01'
and o_orderdate < date '1993-07-01' + interval '3' month
and o_orderkey in (
select
l_orderkey
from
lineitem
where l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority
Rewritten by PPL InSubquery query:
source = orders
| where o_orderdate >= "1993-07-01" and o_orderdate < "1993-10-01" and o_orderkey IN
[ source = lineitem
| where l_commitdate < l_receiptdate
| fields l_orderkey
]
| stats count(1) as order_count by o_orderpriority
| sort o_orderpriority
| fields o_orderpriority, order_count
2.tpch q20 (nested in-subquery)
select
s_name,
s_address
from
supplier,
nation
where
s_suppkey in (
select
ps_suppkey
from
partsupp
where
ps_partkey in (
select
p_partkey
from
part
where
p_name like 'forest%'
)
)
and s_nationkey = n_nationkey
and n_name = 'CANADA'
order by
s_name
Rewritten by PPL InSubquery query:
source = supplier
| where s_suppkey IN [
source = partsupp
| where ps_partkey IN [
source = part
| where like(p_name, "forest%")
| fields p_partkey
]
| fields ps_suppkey
]
| inner join left=l right=r on s_nationkey = n_nationkey and n_name = 'CANADA'
nation
| sort s_name
Assumptions: a
, b
are fields of table outer, c
, d
are fields of table inner, e
, f
are fields of table inner2
source = outer | where exists [ source = inner | where a = c ]
source = outer | where not exists [ source = inner | where a = c ]
source = outer | where exists [ source = inner | where a = c and b = d ]
source = outer | where not exists [ source = inner | where a = c and b = d ]
source = outer exists [ source = inner | where a = c ]
(search filtering with subquery)source = outer not exists [ source = inner | where a = c ]
(search filtering with subquery)source = table as t1 exists [ source = table as t2 | where t1.a = t2.a ]
(table alias is useful in exists subquery)source = outer | where exists [ source = inner1 | where a = c and exists [ source = inner2 | where c = e ] ]
(nested)source = outer | where exists [ source = inner1 | where a = c | where exists [ source = inner2 | where c = e ] ]
(nested)source = outer | where exists [ source = inner | where c > 10 ]
(uncorrelated exists)source = outer | where not exists [ source = inner | where c > 10 ]
(uncorrelated exists)source = outer | where exists [ source = inner ] | eval l = "nonEmpty" | fields l
(special uncorrelated exists)
SQL Migration examples with Exists-Subquery PPL:
tpch q4 (exists subquery with aggregation)
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '1993-07-01'
and o_orderdate < date '1993-07-01' + interval '3' month
and exists (
select
l_orderkey
from
lineitem
where l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority
Rewritten by PPL ExistsSubquery query:
source = orders
| where o_orderdate >= "1993-07-01" and o_orderdate < "1993-10-01"
and exists [
source = lineitem
| where l_orderkey = o_orderkey and l_commitdate < l_receiptdate
]
| stats count(1) as order_count by o_orderpriority
| sort o_orderpriority
| fields o_orderpriority, order_count
Assumptions: a
, b
are fields of table outer, c
, d
are fields of table inner, e
, f
are fields of table nested
Uncorrelated scalar subquery in Select
source = outer | eval m = [ source = inner | stats max(c) ] | fields m, a
source = outer | eval m = [ source = inner | stats max(c) ] + b | fields m, a
Uncorrelated scalar subquery in Where
source = outer | where a > [ source = inner | stats min(c) ] | fields a
Uncorrelated scalar subquery in Search filter
source = outer a > [ source = inner | stats min(c) ] | fields a
Correlated scalar subquery in Select
source = outer | eval m = [ source = inner | where outer.b = inner.d | stats max(c) ] | fields m, a
source = outer | eval m = [ source = inner | where b = d | stats max(c) ] | fields m, a
source = outer | eval m = [ source = inner | where outer.b > inner.d | stats max(c) ] | fields m, a
Correlated scalar subquery in Where
source = outer | where a = [ source = inner | where outer.b = inner.d | stats max(c) ]
source = outer | where a = [ source = inner | where b = d | stats max(c) ]
source = outer | where [ source = inner | where outer.b = inner.d OR inner.d = 1 | stats count() ] > 0 | fields a
Correlated scalar subquery in Search filter
source = outer a = [ source = inner | where b = d | stats max(c) ]
source = outer [ source = inner | where outer.b = inner.d OR inner.d = 1 | stats count() ] > 0 | fields a
Nested scalar subquery
source = outer | where a = [ source = inner | stats max(c) | sort c ] OR b = [ source = inner | where c = 1 | stats min(d) | sort d ]
source = outer | where a = [ source = inner | where c = [ source = nested | stats max(e) by f | sort f ] | stats max(d) by c | sort c | head 1 ]
SQL Migration examples with Scalar-Subquery PPL: Example 1
SELECT *
FROM outer
WHERE a = (SELECT max(c)
FROM inner1
WHERE c = (SELECT max(e)
FROM inner2
GROUP BY f
ORDER BY f
)
GROUP BY c
ORDER BY c
LIMIT 1)
Rewritten by PPL ScalarSubquery query:
source = spark_catalog.default.outer
| where a = [
source = spark_catalog.default.inner1
| where c = [
source = spark_catalog.default.inner2
| stats max(e) by f
| sort f
]
| stats max(d) by c
| sort c
| head 1
]
Example 2
SELECT * FROM outer
WHERE a = (SELECT max(c)
FROM inner
ORDER BY c)
OR b = (SELECT min(d)
FROM inner
WHERE c = 1
ORDER BY d)
Rewritten by PPL ScalarSubquery query:
source = spark_catalog.default.outer
| where a = [
source = spark_catalog.default.inner | stats max(c) | sort c
] OR b = [
source = spark_catalog.default.inner | where c = 1 | stats min(d) | sort d
]
InSubquery
, ExistsSubquery
and ScalarSubquery
are all subquery expressions. But RelationSubquery
is not a subquery expression, it is a subquery plan which is common used in Join or From clause.
source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ]
(subquery in join right side)source = [ source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] | stats count(a) by b ] as outer | head 1
SQL Migration examples with Subquery PPL:
tpch q13
select
c_count,
count(*) as custdist
from
(
select
c_custkey,
count(o_orderkey) as c_count
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%special%requests%'
group by
c_custkey
) as c_orders
group by
c_count
order by
custdist desc,
c_count desc
Rewritten by PPL (Relation) Subquery:
SEARCH source = [
SEARCH source = customer
| LEFT OUTER JOIN left = c right = o ON c_custkey = o_custkey
[
SEARCH source = orders
| WHERE not like(o_comment, '%special%requests%')
]
| STATS COUNT(o_orderkey) AS c_count BY c_custkey
] AS c_orders
| STATS COUNT(o_orderkey) AS c_count BY c_custkey
| STATS COUNT(1) AS custdist BY c_count
| SORT - custdist, - c_count
InSubquery
, ExistsSubquery
and ScalarSubquery
as subquery expressions, their common usage is in where
clause and search filter
.
Where command:
| where <boolean expression> | ...
Search filter:
search source=* <boolean expression> | ...
A subquery expression could be used in boolean expression, for example
| where orders.order_id in [ source=returns | where return_reason="damaged" | field order_id ]
The orders.order_id in [ source=... ]
is a <boolean expression>
.
In general, we name this kind of subquery clause the InSubquery
expression, it is a <boolean expression>
.
Subquery with Different Join Types
In issue description is a ScalarSubquery
:
source=employees
| join source=sales on employees.employee_id = sales.employee_id
| where sales.sale_amount > [ source=targets | where target_met="true" | fields target_value ]
But RelationSubquery
is not a subquery expression, it is a subquery plan.
Recall the join command doc, the example is a subquery/subsearch plan, rather than a expression.
SEARCH source=customer
| FIELDS c_custkey
| LEFT OUTER JOIN left = c, right = o ON c.c_custkey = o.o_custkey
[
SEARCH source=orders
| WHERE o_comment NOT LIKE '%unusual%packages%'
| FIELDS o_orderkey, o_custkey
]
| STATS ...
simply into
SEARCH <leftPlan>
| LEFT OUTER JOIN ON <condition>
[
<rightPlan>
]
| STATS ...
Apply the syntax here and simply into
search <leftPlan> | left join on <condition> [ search ... ]
The [ search ...]
is not a expression
, it's plan
, similar to the relation
plan
Uncorrelated Subquery
An uncorrelated subquery is independent of the outer query. It is executed once, and the result is used by the outer query.
It's less common when using ExistsSubquery
because ExistsSubquery
typically checks for the presence of rows that are dependent on the outer query’s row.
There is a very special exists subquery which highlight by (special uncorrelated exists)
:
SELECT 'nonEmpty'
FROM outer
WHERE EXISTS (
SELECT *
FROM inner
);
Rewritten by PPL ExistsSubquery query:
source = outer
| where exists [
source = inner
]
| eval l = "nonEmpty"
| fields l
This query just print "nonEmpty" if the inner table is not empty.
Table alias in subquery
Table alias is useful in query which contains a subquery, for example
select a, (
select sum(b)
from catalog.schema.table1 as t1
where t1.a = t2.a
) sum_b
from catalog.schema.table2 as t2
t1
and t2
are table aliases which are used in correlated subquery, sum_b
are subquery alias.