[SPARK-41271][SQL] Support parameterized SQL queries by `sql()` #38864

MaxGekk · 2022-12-01T19:07:24Z

What changes were proposed in this pull request?

In the PR, I propose to extend SparkSession API and override the sql method by:

  def sql(sqlText: String, args: Map[String, String]): DataFrame

which accepts a map with:

keys are parameters names,
values are SQL literal values.

And the first argument sqlText might have named parameters in the positions of constants like literal values.

For example:

  spark.sql(
    sqlText = "SELECT * FROM tbl WHERE date > :startDate LIMIT :maxRows",
    args = Map(
      "startDate" -> "DATE'2022-12-01'",
      "maxRows" -> "100"))

The new sql() method parses the input SQL statement and provided parameter values, and replaces the named parameters by the literal values. And then it eagerly runs DDL/DML commands, but not for SELECT queries.

Closes #38712

Why are the changes needed?

To improve user experience with Spark SQL via
- Using Spark as remote service (microservice).
- Write SQL code that will power reports, dashboards, charts and other data presentation solutions that need to account for criteria modifiable by users through an interface.
- Build a generic integration layer based on the SQL API. The goal is to expose managed data to a wide application ecosystem with a microservice architecture. It is only natural in such a setup to ask for modular and reusable SQL code, that can be executed repeatedly with different parameter values.
To achieve feature parity with other systems that support named parameters:

Does this PR introduce any user-facing change?

No, this is an extension of the existing APIs.

How was this patch tested?

By running new tests:

$ build/sbt "core/testOnly *SparkThrowableSuite"
$ build/sbt "test:testOnly *PlanParserSuite"
$ build/sbt "test:testOnly *AnalysisSuite"
$ build/sbt "test:testOnly *ParametersSuite"

MaxGekk · 2022-12-02T14:53:40Z

@cloud-fan @entong Could you take a look at this PR, please.

xkrogen · 2022-12-02T21:36:20Z

What is the relationship between this PR and #38712? Why do we have two PRs?

If this PR is superceding #38712, can we continue discussion here on which identifier to use based on my last comment on the old PR?

MaxGekk · 2022-12-13T18:58:37Z

@cloud-fan @entong Could you review this PR one more time, please.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/parameters.scala

…essions/parameters.scala Co-authored-by: Wenchen Fan <[email protected]>

MaxGekk · 2022-12-15T06:13:18Z

Merging to master. The last commit is minor one.
Thank you, @cloud-fan @xkrogen @entong @srielau for review.

srielau · 2022-12-15T20:14:15Z

core/src/main/resources/error/error-classes.json

@@ -795,6 +795,11 @@
      }
    }
  },
+  "INVALID_SQL_ARG" : {
+    "message" : [
+      "The argument <name> of `sql()` is invalid. Consider to replace it by a SQL literal statement."


Can we more explicit about why it is invalid?
"The argument of sql() is not a literal." What is a "SQL Literal statement"?

What is a "SQL Literal statement"?

Any SQL statement that can produce a literal, see https://spark.apache.org/docs/latest/sql-ref-literals.html

That's an expression then. Statements are top level. (Like SELECT, UPDATE, CRAETE, SET).

how about just say SQL literal?

I will change this in #39183

srielau · 2022-12-15T20:16:11Z

core/src/main/resources/error/error-classes.json

@@ -1130,6 +1135,11 @@
      "Unable to convert SQL type <toType> to Protobuf type <protobufType>."
    ]
  },
+  "UNBOUND_SQL_PARAMETER" : {
+    "message" : [
+      "Found the unbound parameter: <name>. Please, fix `args` and provide a mapping of the parameter to a SQL literal statement."


Won't this same error be used fro all other APIs (JDBC, SQL (when we support). So we may not want to refer to args.

So we may not want to refer to args.

This is a premature generalisation. Let's make it more generic when we will need that.

### What changes were proposed in this pull request? In the PR, I propose to extend SparkSession API and override the `sql` method by: ```scala def sql(sqlText: String, args: Map[String, String]): DataFrame ``` which accepts a map with: - keys are parameters names, - values are SQL literal values. And the first argument `sqlText` might have named parameters in the positions of constants like literal values. For example: ```scala spark.sql( sqlText = "SELECT * FROM tbl WHERE date > :startDate LIMIT :maxRows", args = Map( "startDate" -> "DATE'2022-12-01'", "maxRows" -> "100")) ``` The new `sql()` method parses the input SQL statement and provided parameter values, and replaces the named parameters by the literal values. And then it eagerly runs DDL/DML commands, but not for SELECT queries. Closes apache#38712 ### Why are the changes needed? 1. To improve user experience with Spark SQL via - Using Spark as remote service (microservice). - Write SQL code that will power reports, dashboards, charts and other data presentation solutions that need to account for criteria modifiable by users through an interface. - Build a generic integration layer based on the SQL API. The goal is to expose managed data to a wide application ecosystem with a microservice architecture. It is only natural in such a setup to ask for modular and reusable SQL code, that can be executed repeatedly with different parameter values. 2. To achieve feature parity with other systems that support named parameters: - Redshift: https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html#data-api-calling - BigQuery: https://cloud.google.com/bigquery/docs/parameterized-queries#api - MS DBSQL: https://learn.microsoft.com/en-us/azure/databricks/sql/user/queries/query-parameters ### Does this PR introduce _any_ user-facing change? No, this is an extension of the existing APIs. ### How was this patch tested? By running new tests: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" $ build/sbt "test:testOnly *PlanParserSuite" $ build/sbt "test:testOnly *AnalysisSuite" $ build/sbt "test:testOnly *ParametersSuite" ``` Closes apache#38864 from MaxGekk/parameterized-sql-2. Lead-authored-by: Max Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose to extend the `sql()` method in PySpark to support parameterized SQL queries, see #38864, and add new parameter - `args` of the type `Dict[str, str]`. This parameter maps named parameters that can occur in the input SQL query to SQL literals like 1, INTERVAL '1-1' YEAR TO MONTH, DATE'2022-12-22' (see [the doc ](https://spark.apache.org/docs/latest/sql-ref-literals.html)of supported literals). For example: ```python >>> spark.sql("SELECT * FROM range(10) WHERE id > :minId", args = {"minId" : "7"}) id 0 8 1 9 ``` Closes #39159 ### Why are the changes needed? To achieve feature parity with Scala/Java API, and provide PySpark users the same feature. ### Does this PR introduce _any_ user-facing change? No, it shouldn't. ### How was this patch tested? Checked the examples locally, and running the tests: ``` $ python/run-tests --modules=pyspark-sql --parallelism=1 ``` Closes #39183 from MaxGekk/parameterized-sql-pyspark-dict. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature #38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Closes #40623 from MaxGekk/parameterized-sql-any. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` No since the parameterized SQL feature apache#38864 hasn't been released yet. By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Closes apache#40623 from MaxGekk/parameterized-sql-any. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 156a12e) Signed-off-by: Max Gekk <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. This is a backport of #40623 #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature #38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 156a12e) Closes #40666 from MaxGekk/parameterized-sql-any-3.4-2. Authored-by: Max Gekk <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. This is a backport of apache#40623 #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature apache#38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 156a12e) Closes apache#40666 from MaxGekk/parameterized-sql-any-3.4-2. Authored-by: Max Gekk <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…sql` API GA ### What changes were proposed in this pull request? This PR aims to make `Parameterized SQL queries` of `SparkSession.sql` API GA in Apache Spark 4.0.0. ### Why are the changes needed? Apache Spark has been supported `Parameterized SQL queries` because they are very convenient usage for the users . - #38864 (Since Spark 3.4.0) - #41568 (Since Spark 3.5.0) It's time to make it GA by removing `Experimental` tags since this feature has been serving well for a long time. ### Does this PR introduce _any_ user-facing change? No, there is no behavior change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48965 from dongjoon-hyun/SPARK-50422. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Max Gekk <[email protected]>

MaxGekk added 15 commits November 18, 2022 11:32

Add UnboundParameter

6b42040

Add a test

fe3ca7a

Merge remote-tracking branch 'origin/master' into parameterized-sql

ab3c615

Merge remote-tracking branch 'origin/master' into parameterized-sql

3cfef61

Refactoring + add more parsing tests

c264476

Merge remote-tracking branch 'origin/master' into parameterized-sql

17b9ea3

NamedParameter should be bound

0f32b75

Add new method bind() to Dataset

2806334

Return back Literal and fix for scala 2.13

d16e85b

Add the BindParameters rule

e48e2e1

Merge remote-tracking branch 'origin/master' into parameterized-sql

16b9a27

Add an end-to-end test

693cd92

Add a config to control the feature

4e55269

Merge remote-tracking branch 'origin/master' into parameterized-sql

4d1198f

Support parametrized SQL queries by sql()

fcdf5f2

github-actions bot added CORE SQL labels Dec 1, 2022

MaxGekk added 7 commits December 1, 2022 22:57

Just strip any marker

10c7680

Refactoring

cddff4a

Remove the Bind node

a0f568a

Add the error class NON_FOLDABLE_SQL_ARG

2e93bec

Improve comments

cf03c3b

Add unboundError()

0605f22

Remove the BIND tag

210d23c

MaxGekk requested a review from cloud-fan December 2, 2022 14:43

Fix UNBOUND_PARAMETER

16966c9

MaxGekk changed the title ~~[WIP][SPARK-41271][SQL] Support parameterized SQL queries by sql()~~ [SPARK-41271][SQL] Support parameterized SQL queries by sql() Dec 2, 2022

MaxGekk marked this pull request as ready for review December 2, 2022 14:52

MaxGekk requested review from cloud-fan and entong and removed request for entong December 12, 2022 19:14

cloud-fan reviewed Dec 15, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/parameters.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Dec 15, 2022

View reviewed changes

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…

2857350

…essions/parameters.scala Co-authored-by: Wenchen Fan <[email protected]>

MaxGekk closed this in 35fa5e6 Dec 15, 2022

srielau reviewed Dec 15, 2022

View reviewed changes

MaxGekk mentioned this pull request Dec 22, 2022

[SPARK-41666][PYTHON] Support parameterized SQL by sql() #39183

Closed

MaxGekk mentioned this pull request Apr 3, 2023

[SPARK-43009][SQL] Parameterized sql() with Any constants #40623

Closed

MaxGekk mentioned this pull request Apr 4, 2023

[SPARK-43009][SQL][3.4] Parameterized sql() with Any constants #40666

Closed

nohajc mentioned this pull request May 1, 2023

Escape and subtitute query arguments client side databricks/databricks-sql-go#100

Open

dongjoon-hyun mentioned this pull request Nov 26, 2024

[SPARK-50422][SQL] Make Parameterized SQL queries of SparkSession.sql API GA #48965

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41271][SQL] Support parameterized SQL queries by `sql()` #38864

[SPARK-41271][SQL] Support parameterized SQL queries by `sql()` #38864

MaxGekk commented Dec 1, 2022 •

edited

Loading

MaxGekk commented Dec 2, 2022

xkrogen commented Dec 2, 2022

MaxGekk commented Dec 13, 2022

MaxGekk commented Dec 15, 2022

srielau Dec 15, 2022

MaxGekk Dec 15, 2022

srielau Dec 16, 2022

cloud-fan Dec 19, 2022

MaxGekk Dec 22, 2022

srielau Dec 15, 2022

MaxGekk Dec 15, 2022

[SPARK-41271][SQL] Support parameterized SQL queries by sql() #38864

[SPARK-41271][SQL] Support parameterized SQL queries by sql() #38864

Conversation

MaxGekk commented Dec 1, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Dec 2, 2022

xkrogen commented Dec 2, 2022

MaxGekk commented Dec 13, 2022

MaxGekk commented Dec 15, 2022

srielau Dec 15, 2022

Choose a reason for hiding this comment

MaxGekk Dec 15, 2022

Choose a reason for hiding this comment

srielau Dec 16, 2022

Choose a reason for hiding this comment

cloud-fan Dec 19, 2022

Choose a reason for hiding this comment

MaxGekk Dec 22, 2022

Choose a reason for hiding this comment

srielau Dec 15, 2022

Choose a reason for hiding this comment

MaxGekk Dec 15, 2022

Choose a reason for hiding this comment

[SPARK-41271][SQL] Support parameterized SQL queries by `sql()` #38864

[SPARK-41271][SQL] Support parameterized SQL queries by `sql()` #38864

MaxGekk commented Dec 1, 2022 •

edited

Loading