layout | title | nav_order |
---|---|---|
page |
Compatibility |
5 |
The SQL plugin tries to produce results that are bit for bit identical with Apache Spark. There are a number of cases where there are some differences. In most cases operators that produce different results are off by default, and you can look at the configs for more information on how to enable them. In some cases we felt that enabling the incompatibility by default was worth the performance gain. All of those operators can be disabled through configs if it becomes a problem. Please also look at the current list of bugs which are typically incompatibilities that we have not yet addressed.
There are some operators where Spark does not guarantee the order of the output.
These are typically things like aggregates and joins that may use a hash to distribute the work
load among downstream tasks. In these cases the plugin does not guarantee that it will
produce the same output order that Spark does. In cases such as an order by
operation
where the ordering is explicit the plugin will produce an ordering that is compatible with
Spark's guarantee. It may not be 100% identical if the ordering is ambiguous.
The one known issue with this is a bug where
-0.0
and 0.0
compare as equal with the GPU plugin enabled but on the CPU -0.0 < 0.0
.
For most basic floating point operations like addition, subtraction, multiplication, and division
the plugin will produce a bit for bit identical result as Spark does. For other functions like sin
,
cos
, etc. the output may be different, but within the rounding error inherent in floating point
calculations. The ordering of operations to calculate the value may differ between
the underlying JVM implementation used by the CPU and the C++ standard library implementation
used by the GPU.
For aggregations the underlying implementation is doing the aggregations in parallel and due to
race conditions within the computation itself the result may not be the same each time the query is
run. This is inherent in how the plugin speeds up the calculations and cannot be "fixed." If
a query joins on a floating point value, which is not wise to do anyways,
and the value is the result of a floating point aggregation then the join may fail to work
properly with the plugin but would have worked with plain Spark. Because of this most
floating point aggregations are off by default but can be enabled with the config
spark.rapids.sql.variableFloatAgg.enabled
.
Additionally, some aggregations on floating point columns that contain NaN
can produce
incorrect results. More details on this behavior can be found
here,
here,
and in this cudf feature request.
If it is known with certainty that the floating point columns do not contain NaN
,
set spark.rapids.sql.hasNans
to false
to run GPU enabled
aggregations on them.
In the case of a distinct count on NaN
values the
issue only shows up if you have different
NaN
values. There are several different binary values that are all considered to be NaN
by
floating point. The plugin treats all of these as the same value, where as Spark treats them
all as different values. Because this is considered to be rare we do not disable distinct count
for floating point values even if spark.rapids.sql.hasNans
is true
.
Floating point allows zero to be encoded as 0.0
and -0.0
, but the IEEE standard says that
they should be interpreted as the same. Most databases normalize these values to always
be 0.0
. Spark does this in some cases but not all as is documented
here. The underlying implementation of
this plugin treats them as the same for essentially all processing. This can result in some
differences with Spark for operations like
sorting,
distinct count,
joins, and comparisons.
We do not disable operations that produce different results due to -0.0
in the data because
it is considered to be a rare occurrence.
Spark delegates Unicode operations to the underlying JVM. Each version of Java complies with a specific version of the Unicode standard. The SQL plugin does not use the JVM for Unicode support and is compatible with Unicode version 12.1. Because of this there may be corner cases where Spark will produce a different result compared to the plugin.
Spark is very strict when reading CSV and if the data does not conform with the expected format
exactly it will result in a null
value. The underlying parser that the SQL plugin uses is much
more lenient. If you have badly formatted CSV data you may get data back instead of nulls.
If this is a problem you can disable the CSV reader by setting the config
spark.rapids.sql.format.csv.read.enabled
to false
.
Because the speed up is so large and the issues typically only show up in error conditions we felt
it was worth having the CSV reader enabled by default.
There are also discrepancies/issues with specific types that are detailed below.
Writing strings to a CSV file in general for Spark can be problematic unless you can ensure
that your data does not have any line deliminators in it. The GPU accelerated CSV parser
handles quoted line deliminators similar to multiLine
mode. But there are still a number
of issues surrounding it and they should be avoided.
Escaped quote characters '\"'
are not supported well as described by this
issue.
Null values are not respected as described here even though they are supported for other types.
Parsing a timestamp
as a date
does not work. The details are documented in this
issue.
Only a limited set of formats are supported when parsing dates.
"yyyy-MM-dd"
"yyyy/MM/dd"
"yyyy-MM"
"yyyy/MM"
"MM-yyyy"
"MM/yyyy"
"MM-dd-yyyy"
"MM/dd/yyyy"
The reality is that all of these formats are supported at the same time. The plugin will only disable itself if you set a format that it does not support.
As a work around you can parse the column as a timestamp and then cast it to a date.
The CSV parser does not support time zones. It will ignore any trailing time zone
information, despite the format asking for a XXX
or [XXX]
. As such it is off by
default and you can enable it by setting
spark.rapids.sql.csvTimestamps.enabled
to true
.
The formats supported for timestamps are limited similar to dates. The first part of
the format must be a supported date format. The second part must start with a 'T'
to separate the time portion followed by one of the following formats:
HH:mm:ss.SSSXXX
HH:mm:ss[.SSS][XXX]
HH:mm
HH:mm:ss
HH:mm[:ss]
HH:mm:ss.SSS
HH:mm:ss[.SSS]
Just like with dates all timestamp formats are actually supported at the same time. The plugin will disable itself if it sees a format it cannot support.
The CSV parser is not able to parse Infinity
, -Infinity
, or NaN
values. All of
these are likely to be turned into null values, as described in this
issue.
Some floating-point values also appear to overflow but do not for the CPU as described in this issue.
Any number that overflows will not be turned into a null value.
Any number that overflows will not be turned into a null value.
The ORC format has fairly complete support for both reads and writes. There are only a few known issues. The first is for reading timestamps and dates around the transition between Julian and Gregorian calendars as described here. A similar issue exists for writing dates as described here. Writing timestamps, however only appears to work for dates after the epoch as described here.
The plugin supports reading uncompressed
, snappy
and zlib
ORC files and writing uncompressed
and snappy
ORC files. At this point, the plugin does not have the ability to fall back to the
CPU when reading an unsupported compression format, and will error out in that case.
The Parquet format has more configs because there are multiple versions with some compatibility
issues between them. Dates and timestamps are where the known issues exist.
For reads when spark.sql.legacy.parquet.datetimeRebaseModeInWrite
is set to CORRECTED
timestamps before the transition
between the Julian and Gregorian calendars are wrong, but dates are fine. When
spark.sql.legacy.parquet.datetimeRebaseModeInWrite
is set to LEGACY
, however both dates and
timestamps are read incorrectly before the Gregorian calendar transition as described
here.
When writing spark.sql.legacy.parquet.datetimeRebaseModeInWrite
is currently ignored as described
here.
The plugin supports reading uncompressed
, snappy
and gzip
Parquet files and writing
uncompressed
and snappy
Parquet files. At this point, the plugin does not have the ability to
fall back to the CPU when reading an unsupported compression format, and will error out
in that case.
The RAPIDS Accelerator for Apache Spark currently supports string literal matches, not wildcard matches.
If a null char '\0' is in a string that is being matched by a regular expression, LIKE
sees it as
the end of the string. This will be fixed in a future release. The issue is here.
Spark stores timestamps internally relative to the JVM time zone. Converting an arbitrary timestamp between time zones is not currently supported on the GPU. Therefore operations involving timestamps will only be GPU-accelerated if the time zone used by the JVM is UTC.
Because of ordering differences between the CPU and the GPU window functions especially row based
window functions like row_number
, lead
, and lag
can produce different results if the ordering
includes both -0.0
and 0.0
, or if the ordering is ambiguous. Spark can produce
different results from one run to another if the ordering is ambiguous on a window function too.
When converting strings to dates or timestamps using functions like to_date
and unix_timestamp
,
only a subset of possible formats are supported on GPU with full compatibility with Spark. The
supported formats are:
dd/MM/yyyy
yyyy/MM
yyyy/MM/dd
yyyy-MM
yyyy-MM-dd
yyyy-MM-dd HH:mm:ss
Other formats may result in incorrect results and will not run on the GPU by default. Some specific issues with other formats are:
- Spark supports partial microseconds but the plugin does not
- The plugin will produce incorrect results for input data that is not in the correct format in some cases
To enable all formats on GPU, set
spark.rapids.sql.incompatibleDateFormats.enabled
to true
.
In general, performing cast
and ansi_cast
operations on the GPU is compatible with the same operations on the CPU. However, there are some exceptions. For this reason, certain casts are disabled on the GPU by default and require configuration options to be specified to enable them.
The GPU will use different precision than Java's toString method when converting floating-point data types to strings and this can produce results that differ from the default behavior in Spark.
To enable this operation on the GPU, set
spark.rapids.sql.castFloatToString.enabled
to true
.
Casting from string to floating-point types on the GPU returns incorrect results when the string represents any number in the following ranges. In both cases the GPU returns Double.MaxValue
. The default behavior in Apache Spark is to return +Infinity
and -Infinity
, respectively.
1.7976931348623158E308 <= x < 1.7976931348623159E308
-1.7976931348623159E308 < x <= -1.7976931348623158E308
Also, the GPU does not support casting from strings containing hex values.
To enable this operation on the GPU, set
spark.rapids.sql.castStringToFloat.enabled
to true
.
The GPU will return incorrect results for strings representing values greater than Long.MaxValue or less than Long.MinValue. The correct behavior would be to return null for these values, but the GPU currently overflows and returns an incorrect integer value.
To enable this operation on the GPU, set
spark.rapids.sql.castStringToInteger.enabled
to true
.
The following formats/patterns are supported on the GPU. Timezone of UTC is assumed.
Format or Pattern | Supported on GPU? |
---|---|
"yyyy" |
Yes |
"yyyy-[M]M" |
Yes |
"yyyy-[M]M " |
Yes |
"yyyy-[M]M-[d]d" |
Yes |
"yyyy-[M]M-[d]d " |
Yes |
"yyyy-[M]M-[d]d *" |
Yes |
"yyyy-[M]M-[d]d T*" |
Yes |
"epoch" |
Yes |
"now" |
Yes |
"today" |
Yes |
"tomorrow" |
Yes |
"yesterday" |
Yes |
To allow casts from string to timestamp on the GPU, enable the configuration property
spark.rapids.sql.castStringToTimestamp.enabled
.
Casting from string to timestamp currently has the following limitations.
Format or Pattern | Supported on GPU? |
---|---|
"yyyy" |
Yes |
"yyyy-[M]M" |
Yes |
"yyyy-[M]M " |
Yes |
"yyyy-[M]M-[d]d" |
Yes |
"yyyy-[M]M-[d]d " |
Yes |
"yyyy-[M]M-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]" |
Partial [1] |
"yyyy-[M]M-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]" |
Partial [1] |
"[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]" |
Partial [1] |
"T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]" |
Partial [1] |
"epoch" |
Yes |
"now" |
Yes |
"today" |
Yes |
"tomorrow" |
Yes |
"yesterday" |
Yes |
- [1] The timestamp portion must be complete in terms of hours, minutes, seconds, and milliseconds, with 2 digits each for hours, minutes, and seconds, and 6 digits for milliseconds. Only timezone 'Z' (UTC) is supported. Casting unsupported formats will result in null values.
To speedup the process of UDF, spark-rapids introduces a udf-compiler extension to translate UDFs to Catalyst expressions.
To enable this operation on the GPU, set
spark.rapids.sql.udfCompiler.enabled
to true
.
However, Spark may produce different results for a compiled udf and the non-compiled. For example: a udf of x/y
where y
happens to be 0
, the compiled catalyst expressions will return NULL
while the original udf would fail the entire job with a java.lang.ArithmeticException: / by zero
When translating UDFs to Catalyst expressions, the supported UDF functions are limited:
Operand type | Operation |
---|---|
Arithmetic Unary | +x |
-x | |
Arithmetic Binary | lhs + rhs |
lhs - rhs | |
lhs * rhs | |
lhs / rhs | |
lhs % rhs | |
Logical | lhs && rhs |
lhs || rhs | |
!x | |
Equality and Relational | lhs == rhs |
lhs < rhs | |
lhs <= rhs | |
lhs > rhs | |
lhs >= rhs | |
Bitwise | lhs & rhs |
lhs | rhs | |
lhs ^ rhs | |
~x | |
lhs << rhs | |
lhs >> rhs | |
lhs >>> rhs | |
Conditional | if |
case | |
Math | abs(x) |
cos(x) | |
acos(x) | |
asin(x) | |
tan(x) | |
atan(x) | |
tanh(x) | |
cosh(x) | |
ceil(x) | |
floor(x) | |
exp(x) | |
log(x) | |
log10(x) | |
sqrt(x) | |
x.isNaN | |
Type Cast | * |
String | lhs + rhs |
lhs.equalsIgnoreCase(String rhs) | |
x.toUpperCase() | |
x.trim() | |
x.substring(int begin) | |
x.substring(int begin, int end) | |
x.replace(char oldChar, char newChar) | |
x.replace(CharSequence target, CharSequence replacement) | |
x.startsWith(String prefix) | |
lhs.equals(Object rhs) | |
x.toLowerCase() | |
x.length() | |
x.endsWith(String suffix) | |
lhs.concat(String rhs) | |
x.isEmpty() | |
String.valueOf(boolean b) | |
String.valueOf(char c) | |
String.valueOf(double d) | |
String.valueOf(float f) | |
String.valueOf(int i) | |
String.valueOf(long l) | |
x.contains(CharSequence s) | |
x.indexOf(String str) | |
x.indexOf(String str, int fromIndex) | |
x.replaceAll(String regex, String replacement) | |
x.split(String regex) | |
x.split(String regex, int limit) | |
x.getBytes() | |
x.getBytes(String charsetName) | |
Date and Time | LocalDateTime.parse(x, DateTimeFormatter.ofPattern(pattern)).getYear |
LocalDateTime.parse(x, DateTimeFormatter.ofPattern(pattern)).getMonthValue | |
LocalDateTime.parse(x, DateTimeFormatter.ofPattern(pattern)).getDayOfMonth | |
LocalDateTime.parse(x, DateTimeFormatter.ofPattern(pattern)).getHour | |
LocalDateTime.parse(x, DateTimeFormatter.ofPattern(pattern)).getMinute | |
LocalDateTime.parse(x, DateTimeFormatter.ofPattern(pattern)).getSecond | |
Empty array creation | Array.empty[Boolean] |
Array.empty[Byte] | |
Array.empty[Short] | |
Array.empty[Int] | |
Array.empty[Long] | |
Array.empty[Float] | |
Array.empty[Double] | |
Array.empty[String] | |
Arraybuffer | new ArrayBuffer() |
x.distinct | |
x.toArray | |
lhs += rhs | |
lhs :+ rhs | |
Method call | Only if the method being called
|