Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable regular expressions by default #5591

Merged
merged 4 commits into from
May 31, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 13 additions & 6 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -565,9 +565,6 @@ The boolean, byte, short, int, long, float, double, string are supported in curr

## Regular Expressions

Regular expression evaluation on the GPU can potentially have high memory overhead and cause out-of-memory errors so
this is disabled by default. To enable regular expressions on the GPU, set `spark.rapids.sql.regexp.enabled=true`.

The following Apache Spark regular expression functions and expressions are supported on the GPU:

- `RLIKE`
Expand All @@ -578,10 +575,20 @@ The following Apache Spark regular expression functions and expressions are supp
- `string_split`
- `str_to_map`

There are instances where regular expression operations will fall back to CPU when the RAPIDS Accelerator determines
that a pattern is either unsupported or would produce incorrect results on the GPU.
Regular expression evaluation on the GPU is enabled by default. Execution will fall back to the CPU for
regular expressions that are not yet supported on the GPU. However, there are some edge cases that will
still execute on the GPU and produce different results to the CPU. To disable regular expressions on the GPU,
set `spark.rapids.sql.regexp.enabled=false`.

These are the known edge cases where running on the GPU will produce different results to the CPU:

- Using regular expressions with Unicode data can produce incorrect results if the system `LANG` is not set
to `en_US.UTF-8` ([#5549](https://github.com/NVIDIA/spark-rapids/issues/5549))
- Regular expressions that contain an end of line anchor '$' or end of string anchor '\Z' or '\z' immediately
next to a newline or a repetition that produces zero or more results
([#5610](https://github.com/NVIDIA/spark-rapids/pull/5610))`

Here are some examples of regular expression patterns that are not supported on the GPU and will fall back to the CPU.
The following regular expression patterns are not yet supported on the GPU and will fall back to the CPU.

- Line anchor `^` is not supported in some contexts, such as when combined with a choice (`^|a`).
- Line anchor `$` is not supported by `regexp_replace`, and in some rare contexts.
Expand Down
2 changes: 1 addition & 1 deletion docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ Name | Description | Default Value
<a name="sql.python.gpu.enabled"></a>spark.rapids.sql.python.gpu.enabled|This is an experimental feature and is likely to change in the future. Enable (true) or disable (false) support for scheduling Python Pandas UDFs with GPU resources. When enabled, pandas UDFs are assumed to share the same GPU that the RAPIDs accelerator uses and will honor the python GPU configs|false
<a name="sql.reader.batchSizeBytes"></a>spark.rapids.sql.reader.batchSizeBytes|Soft limit on the maximum number of bytes the reader reads per batch. The readers will read chunks of data until this limit is met or exceeded. Note that the reader may estimate the number of bytes that will be used on the GPU in some cases based on the schema and number of rows in each batch.|2147483647
<a name="sql.reader.batchSizeRows"></a>spark.rapids.sql.reader.batchSizeRows|Soft limit on the maximum number of rows the reader will read per batch. The orc and parquet readers will read row groups until this limit is met or exceeded. The limit is respected by the csv reader.|2147483647
<a name="sql.regexp.enabled"></a>spark.rapids.sql.regexp.enabled|Specifies whether regular expressions should be evaluated on GPU. Complex expressions can cause out of memory issues so this is disabled by default. Setting this config to true will make supported regular expressions run on the GPU. See the compatibility guide for more information about which regular expressions are supported on the GPU.|false
<a name="sql.regexp.enabled"></a>spark.rapids.sql.regexp.enabled|Specifies whether supported regular expressions will be evaluated on the GPU. Unsupported expressions will fall back to CPU. However, there are some known edge cases that will still execute on GPU and produce incorrect results and these are documented in the compatibility guide. Setting this config to false will make all regular expressions run on the CPU instead.|true
<a name="sql.replaceSortMergeJoin.enabled"></a>spark.rapids.sql.replaceSortMergeJoin.enabled|Allow replacing sortMergeJoin with HashJoin|true
<a name="sql.rowBasedUDF.enabled"></a>spark.rapids.sql.rowBasedUDF.enabled|When set to true, optimizes a row-based UDF in a GPU operation by transferring only the data it needs between GPU and CPU inside a query operation, instead of falling this operation back to CPU. This is an experimental feature, and this config might be removed in the future.|false
<a name="sql.shuffle.spillThreads"></a>spark.rapids.sql.shuffle.spillThreads|Number of threads used to spill shuffle data to disk in the background.|6
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1037,12 +1037,13 @@ object RapidsConf {
.createWithDefault(true)

val ENABLE_REGEXP = conf("spark.rapids.sql.regexp.enabled")
.doc("Specifies whether regular expressions should be evaluated on GPU. Complex expressions " +
"can cause out of memory issues so this is disabled by default. Setting this config to " +
"true will make supported regular expressions run on the GPU. See the compatibility " +
"guide for more information about which regular expressions are supported on the GPU.")
.doc("Specifies whether supported regular expressions will be evaluated on the GPU. " +
"Unsupported expressions will fall back to CPU. However, there are some known edge cases " +
"that will still execute on GPU and produce incorrect results and these are documented in " +
"the compatibility guide. Setting this config to false will make all regular expressions " +
"run on the CPU instead.")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)

// INTERNAL TEST AND DEBUG CONFIGS

Expand Down