diff --git a/docs/source/user-guide/sql/subqueries.md b/docs/source/user-guide/sql/subqueries.md index 6055b0fc5b78..ee75a6a1575c 100644 --- a/docs/source/user-guide/sql/subqueries.md +++ b/docs/source/user-guide/sql/subqueries.md @@ -19,80 +19,391 @@ # Subqueries -DataFusion supports `EXISTS`, `NOT EXISTS`, `IN`, `NOT IN` and Scalar Subqueries. +Subqueries (also known as inner queries or nested queries) are queries within +a query. +Subqueries can be used in `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses. -The examples below are based on the following table. +The examples below are based on the following tables. ```sql -select * from x; +SELECT * FROM x; + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +```sql +SELECT * FROM y; + ++--------+--------+ +| number | string | ++--------+--------+ +| 1 | one | ++--------+--------+ +| 2 | two | ++--------+--------+ +| 3 | three | ++--------+--------+ +| 4 | four | ++--------+--------+ ``` -## EXISTS +## Subquery operators -The `EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces one or more matches -for that row. Only correlated subqueries are supported. +- [[ NOT ] EXISTS](#-not--exists) +- [[ NOT ] IN](#-not--in) + +### [ NOT ] EXISTS + +The `EXISTS` operator returns all rows where a +_[correlated subquery](#correlated-subqueries)_ produces one or more matches for +that row. `NOT EXISTS` returns all rows where a _correlated subquery_ produces +zero matches for that row. Only _correlated subqueries_ are supported. ```sql -select * from x y where exists (select * from x where x.column_1 = y.column_1); +[NOT] EXISTS (subquery) +``` + +### [ NOT ] IN + +The `IN` operator returns all rows where a given expression’s value can be found +in the results of a _[correlated subquery](#correlated-subqueries)_. +`NOT IN` returns all rows where a given expression’s value cannot be found in +the results of a subquery or list of values. + +```sql +expression [NOT] IN (subquery|list-literal) +``` + +#### Examples + +```sql +SELECT * FROM x WHERE column_1 IN (1,3); + +----------+----------+ | column_1 | column_2 | +----------+----------+ | 1 | 2 | +----------+----------+ -1 row in set. ``` -## NOT EXISTS +```sql +SELECT * FROM x WHERE column_1 NOT IN (1,3); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ +``` + +## SELECT clause subqueries + +`SELECT` clause subqueries use values returned from the inner query as part +of the outer query's `SELECT` list. +The `SELECT` clause only supports [scalar subqueries](#scalar-subqueries) that +return a single value per execution of the inner query. +The returned value can be unique per row. + +```sql +SELECT [expression1[, expression2, ..., expressionN],] () +``` + +**Note**: `SELECT` clause subqueries can be used as an alternative to `JOIN` +operations. -The `NOT EXISTS` syntax can be used to find all rows in a relation where a correlated subquery produces zero matches -for that row. Only correlated subqueries are supported. +### Example ```sql -select * from x y where not exists (select * from x where x.column_1 = y.column_1); -0 rows in set. +SELECT + column_1, + ( + SELECT + first_value(string) + FROM + y + WHERE + number = x.column_1 + ) AS "numeric string" +FROM + x; + ++----------+----------------+ +| column_1 | numeric string | ++----------+----------------+ +| 1 | one | +| 2 | two | ++----------+----------------+ ``` -## IN +## FROM clause subqueries -The `IN` syntax can be used to find all rows in a relation where a given expression's value can be found in the -results of a correlated subquery. +`FROM` clause subqueries return a set of results that is then queried and +operated on by the outer query. ```sql -select * from x where column_1 in (select column_1 from x); +SELECT expression1[, expression2, ..., expressionN] FROM () +``` + +### Example + +The following query returns the average of maximum values per room. +The inner query returns the maximum value for each field from each room. +The outer query uses the results of the inner query and returns the average +maximum value for each field. + +```sql +SELECT + column_2 +FROM + ( + SELECT + * + FROM + x + WHERE + column_1 > 1 + ); + ++----------+ +| column_2 | ++----------+ +| 4 | ++----------+ +``` + +## WHERE clause subqueries + +`WHERE` clause subqueries compare an expression to the result of the subquery +and return _true_ or _false_. +Rows that evaluate to _false_ or NULL are filtered from results. +The `WHERE` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + expression1[, expression2, ..., expressionN] +FROM + +WHERE + expression operator () +``` + +**Note:** `WHERE` clause subqueries can be used as an alternative to `JOIN` +operations. + +### Examples + +#### `WHERE` clause with scalar subquery + +The following query returns all rows with `column_2` values above the average +of all `number` values in `y`. + +```sql +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(number) + FROM + y + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` -## NOT IN +#### `WHERE` clause with non-scalar subquery -The `NOT IN` syntax can be used to find all rows in a relation where a given expression's value can not be found in the -results of a correlated subquery. +Non-scalar subqueries must use the `[NOT] IN` or `[NOT] EXISTS` operators and +can only return a single column. +The values in the returned column are evaluated as a list. + +The following query returns all rows with `column_2` values in table `x` that +are in the list of numbers with string lengths greater than three from table +`y`. ```sql -select * from x where column_1 not in (select column_1 from x); -0 rows in set. +SELECT + * +FROM + x +WHERE + column_2 IN ( + SELECT + number + FROM + y + WHERE + length(string) > 3 + ); + ++----------+----------+ +| column_1 | column_2 | ++----------+----------+ +| 2 | 4 | ++----------+----------+ ``` -## Scalar Subquery +### `WHERE` clause with correlated subquery -A scalar subquery can be used to produce a single value that can be used in many different contexts in a query. Here -is an example of a filter using a scalar subquery. Only correlated subqueries are supported. +The following query returns rows with `column_2` values from table `x` greater +than the average `string` value length from table `y`. +The subquery in the `WHERE` clause uses the `column_1` value from the outer +query to return the average `string` value length for that specific value. ```sql -select * from x y where column_1 < (select sum(column_2) from x where x.column_1 = y.column_1); +SELECT + * +FROM + x +WHERE + column_2 > ( + SELECT + AVG(length(string)) + FROM + y + WHERE + number = x.column_1 + ); + +----------+----------+ | column_1 | column_2 | +----------+----------+ -| 1 | 2 | +| 2 | 4 | +----------+----------+ -1 row in set. ``` + +## HAVING clause subqueries + +`HAVING` clause subqueries compare an expression that uses aggregate values +returned by aggregate functions in the `SELECT` clause to the result of the +subquery and return _true_ or _false_. +Rows that evaluate to _false_ are filtered from results. +The `HAVING` clause supports correlated and non-correlated subqueries +as well as scalar and non-scalar subqueries (depending on the the operator used +in the predicate expression). + +```sql +SELECT + aggregate_expression1[, aggregate_expression2, ..., aggregate_expressionN] +FROM + +WHERE + +GROUP BY + column_expression1[, column_expression2, ..., column_expressionN] +HAVING + expression operator () +``` + +### Examples + +The following query calculates the averages of even and odd numbers in table `y` +and returns the averages that are equal to the maximum value of `column_1` +in table `x`. + +#### `HAVING` clause with a scalar subquery + +```sql +SELECT + AVG(number) AS avg, + (number % 2 = 0) AS even +FROM + y +GROUP BY + even +HAVING + avg = ( + SELECT + MAX(column_1) + FROM + x + ); + ++-------+--------+ +| avg | even | ++-------+--------+ +| 2 | false | ++-------+--------+ +``` + +#### `HAVING` clause with a non-scalar subquery + +Non-scalar subqueries must use the `[NOT] IN` or `[NOT] EXISTS` operators and +can only return a single column. +The values in the returned column are evaluated as a list. + +The following query calculates the averages of even and odd numbers in table `y` +and returns the averages that are in `column_1` of table `x`. + +```sql +SELECT + AVG(number) AS avg, + (number % 2 = 0) AS even +FROM + y +GROUP BY + even +HAVING + avg IN ( + SELECT + column_1 + FROM + x + ); + ++-------+--------+ +| avg | even | ++-------+--------+ +| 2 | false | ++-------+--------+ +``` + +## Subquery categories + +Subqueries can be categorized as one or more of the following based on the +behavior of the subquery: + +- [correlated](#correlated-subqueries) or + [non-correlated](#non-correlated-subqueries) +- [scalar](#scalar-subqueries) or [non-scalar](#non-scalar-subqueries) + +### Correlated subqueries + +In a **correlated** subquery, the inner query depends on the values of the +current row being processed. + +**Note:** DataFusion internally rewrites correlated subqueries into JOINs to +improve performance. In general correlated subqueries are **less performant** +than non-correlated subqueries. + +### Non-correlated subqueries + +In a **non-correlated** subquery, the inner query _doesn't_ depend on the outer +query and executes independently. +The inner query executes first, and then passes the results to the outer query. + +### Scalar subqueries + +A **scalar** subquery returns a single value (one column of one row). +If no rows are returned, the subquery returns NULL. + +### Non-scalar subqueries + +A **non-scalar** subquery returns 0, 1, or multiple rows, each of which may +contain 1 or multiple columns. For each column, if there is no value to return, +the subquery returns NULL. If no rows qualify to be returned, the subquery +returns 0 rows.