Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[auto-merge] branch-0.5 to branch-0.6 [skip ci] [bot] #2162

Merged
merged 1 commit into from
Apr 16, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 48 additions & 9 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,16 +103,41 @@ will produce a different result compared to the plugin.

## CSV Reading

Spark is very strict when reading CSV and if the data does not conform with the expected format
exactly it will result in a `null` value. The underlying parser that the SQL plugin uses is much
more lenient. If you have badly formatted CSV data you may get data back instead of nulls. If this
is a problem you can disable the CSV reader by setting the config
[`spark.rapids.sql.format.csv.read.enabled`](configs.md#sql.format.csv.read.enabled) to `false`.
Because the speed up is so large and the issues typically only show up in error conditions we felt
it was worth having the CSV reader enabled by default.
Due to inconsistencies between how CSV data is parsed CSV parsing is off by default.
Each data type can be enabled or disabled independently using the following configs.

* [spark.rapids.sql.csv.read.bool.enabled](configs.md#sql.csv.read.bool.enabled)
* [spark.rapids.sql.csv.read.byte.enabled](configs.md#sql.csv.read.byte.enabled)
* [spark.rapids.sql.csv.read.date.enabled](configs.md#sql.csv.read.date.enabled)
* [spark.rapids.sql.csv.read.double.enabled](configs.md#sql.csv.read.double.enabled)
* [spark.rapids.sql.csv.read.float.enabled](configs.md#sql.csv.read.float.enabled)
* [spark.rapids.sql.csv.read.integer.enabled](configs.md#sql.csv.read.integer.enabled)
* [spark.rapids.sql.csv.read.long.enabled](configs.md#sql.csv.read.long.enabled)
* [spark.rapids.sql.csv.read.short.enabled](configs.md#sql.csv.read.short.enabled)
* [spark.rapids.sql.csvTimestamps.enabled](configs.md#sql.csvTimestamps.enabled)

If you know that your particular data type will be parsed correctly enough, you may enable each
type you expect to use. Often the performance improvement is so good that it is worth
checking if it is parsed correctly.

Spark is generally very strict when reading CSV and if the data does not conform with the
expected format exactly it will result in a `null` value. The underlying parser that the RAPIDS Accelerator
uses is much more lenient. If you have badly formatted CSV data you may get data back instead of
nulls.

Spark allows for stripping leading and trailing white space using various options that are off by
default. The plugin will strip leading and trailing space for all values except strings.

There are also discrepancies/issues with specific types that are detailed below.

### CSV Boolean

Invalid values like `BAD` show up as `true` as described by this
[issue](https://github.com/NVIDIA/spark-rapids/issues/2071)

This is the same for all other types, but because that is the only issue with boolean parsing
we have called it out specifically here.

### CSV Strings
Writing strings to a CSV file in general for Spark can be problematic unless you can ensure that
your data does not have any line deliminators in it. The GPU accelerated CSV parser handles quoted
Expand Down Expand Up @@ -140,7 +165,12 @@ Only a limited set of formats are supported when parsing dates.
The reality is that all of these formats are supported at the same time. The plugin will only
disable itself if you set a format that it does not support.

As a work around you can parse the column as a timestamp and then cast it to a date.
As a workaround you can parse the column as a timestamp and then cast it to a date.

Invalid dates in Spark, values that have the correct format, but the numbers produce invalid dates,
can result in an exception by default, and how they are parsed can be controlled through a config.
The RAPIDS Accelerator does not support any of this and will produce an incorrect date. Typically,
one that overflowed.

### CSV Timestamps
The CSV parser does not support time zones. It will ignore any trailing time zone information,
Expand All @@ -163,9 +193,14 @@ portion followed by one of the following formats:
Just like with dates all timestamp formats are actually supported at the same time. The plugin will
disable itself if it sees a format it cannot support.

Invalid timestamps in Spark, ones that have the correct format, but the numbers produce invalid
dates or times, can result in an exception by default and how they are parsed can be controlled
through a config. The RAPIDS Accelerator does not support any of this and will produce an incorrect
date. Typically, one that overflowed.

### CSV Floating Point

The CSV parser is not able to parse `Infinity`, `-Infinity`, or `NaN` values. All of these are
The CSV parser is not able to parse `NaN` values. These are
likely to be turned into null values, as described in this
[issue](https://github.com/NVIDIA/spark-rapids/issues/125).

Expand All @@ -174,6 +209,10 @@ Some floating-point values also appear to overflow but do not for the CPU as des

Any number that overflows will not be turned into a null value.

Also parsing of some values will not produce bit for bit identical results to what the CPU does.
They are within round-off errors except when they are close enough to overflow to Inf or -Inf which
then results in a number being returned when the CPU would have returned null.

### CSV Integer

Any number that overflows will not be turned into a null value.
Expand Down
8 changes: 8 additions & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,14 @@ Name | Description | Default Value
<a name="sql.castStringToInteger.enabled"></a>spark.rapids.sql.castStringToInteger.enabled|When set to true, enables casting from strings to integer types (byte, short, int, long) on the GPU. Casting from string to integer types on the GPU returns incorrect results when the string represents a number larger than Long.MaxValue or smaller than Long.MinValue.|false
<a name="sql.castStringToTimestamp.enabled"></a>spark.rapids.sql.castStringToTimestamp.enabled|When set to true, casting from string to timestamp is supported on the GPU. The GPU only supports a subset of formats when casting strings to timestamps. Refer to the CAST documentation for more details.|false
<a name="sql.concurrentGpuTasks"></a>spark.rapids.sql.concurrentGpuTasks|Set the number of tasks that can execute concurrently per GPU. Tasks may temporarily block when the number of concurrent tasks in the executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors.|1
<a name="sql.csv.read.bool.enabled"></a>spark.rapids.sql.csv.read.bool.enabled|Parsing an invalid CSV boolean value produces true instead of null|false
<a name="sql.csv.read.byte.enabled"></a>spark.rapids.sql.csv.read.byte.enabled|Parsing CSV bytes is much more lenient and will return 0 for some malformed values instead of null|false
<a name="sql.csv.read.date.enabled"></a>spark.rapids.sql.csv.read.date.enabled|Parsing invalid CSV dates produces different results from Spark|false
<a name="sql.csv.read.double.enabled"></a>spark.rapids.sql.csv.read.double.enabled|Parsing CSV double has some issues at the min and max values for floatingpoint numbers and can be more lenient on parsing inf and -inf values|false
<a name="sql.csv.read.float.enabled"></a>spark.rapids.sql.csv.read.float.enabled|Parsing CSV floats has some issues at the min and max values for floatingpoint numbers and can be more lenient on parsing inf and -inf values|false
<a name="sql.csv.read.integer.enabled"></a>spark.rapids.sql.csv.read.integer.enabled|Parsing CSV integers is much more lenient and will return 0 for some malformed values instead of null|false
<a name="sql.csv.read.long.enabled"></a>spark.rapids.sql.csv.read.long.enabled|Parsing CSV longs is much more lenient and will return 0 for some malformed values instead of null|false
<a name="sql.csv.read.short.enabled"></a>spark.rapids.sql.csv.read.short.enabled|Parsing CSV shorts is much more lenient and will return 0 for some malformed values instead of null|false
<a name="sql.csvTimestamps.enabled"></a>spark.rapids.sql.csvTimestamps.enabled|When set to true, enables the CSV parser to read timestamps. The default output format for Spark includes a timezone at the end. Anything except the UTC timezone is not supported. Timestamps after 2038 and before 1902 are also not supported.|false
<a name="sql.decimalType.enabled"></a>spark.rapids.sql.decimalType.enabled|Enable decimal type support on the GPU. Decimal support on the GPU is limited to less than 18 digits. This can result in a lot of data movement to and from the GPU, which can slow down processing in some cases.|false
<a name="sql.enabled"></a>spark.rapids.sql.enabled|Enable (true) or disable (false) sql operations on the GPU|true
Expand Down
Loading