Skip to content

Commit

Permalink
support casting Decimal to String (#2046)
Browse files Browse the repository at this point in the history
* support casting decimal to string

Signed-off-by: sperlingxx <[email protected]>

* update CastChecks

Signed-off-by: sperlingxx <[email protected]>

* sync to latest main

* add RapidsConfig isCastDecimalToStringEnabled

* fix

Signed-off-by: sperlingxx <[email protected]>

* update configs.md
  • Loading branch information
sperlingxx authored Apr 5, 2021
1 parent 555a318 commit 7cbbc12
Show file tree
Hide file tree
Showing 9 changed files with 63 additions and 6 deletions.
2 changes: 2 additions & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,13 @@ Name | Description | Default Value
<a name="shuffle.ucx.managementServerHost"></a>spark.rapids.shuffle.ucx.managementServerHost|The host to be used to start the management server|null
<a name="shuffle.ucx.useWakeup"></a>spark.rapids.shuffle.ucx.useWakeup|When set to true, use UCX's event-based progress (epoll) in order to wake up the progress thread when needed, instead of a hot loop.|true
<a name="sql.batchSizeBytes"></a>spark.rapids.sql.batchSizeBytes|Set the target number of bytes for a GPU batch. Splits sizes for input data is covered by separate configs. The maximum setting is 2 GB to avoid exceeding the cudf row count limit of a column.|2147483647
<a name="sql.castDecimalToString.enabled"></a>spark.rapids.sql.castDecimalToString.enabled|When set to true, casting from decimal to string is supported on the GPU. The GPU does NOT produce exact same string as spark produces, but producing strings which are semantically equal. For instance, given input BigDecimal(123, -2), the GPU produces "12300", which spark produces "1.23E+4".|false
<a name="sql.castFloatToDecimal.enabled"></a>spark.rapids.sql.castFloatToDecimal.enabled|Casting from floating point types to decimal on the GPU returns results that have tiny difference compared to results returned from CPU.|false
<a name="sql.castFloatToIntegralTypes.enabled"></a>spark.rapids.sql.castFloatToIntegralTypes.enabled|Casting from floating point types to integral types on the GPU supports a slightly different range of values when using Spark 3.1.0 or later. Refer to the CAST documentation for more details.|false
<a name="sql.castFloatToString.enabled"></a>spark.rapids.sql.castFloatToString.enabled|Casting from floating point types to string on the GPU returns results that have a different precision than the default results of Spark.|false
<a name="sql.castStringToDecimal.enabled"></a>spark.rapids.sql.castStringToDecimal.enabled|When set to true, enables casting from strings to decimal type on the GPU. Currently string to decimal type on the GPU might produce results which slightly differed from the correct results when the string represents any number exceeding the max precision that CAST_STRING_TO_FLOAT can keep. For instance, the GPU returns 99999999999999987 given input string "99999999999999999". The cause of divergence is that we can not cast strings containing scientific notation to decimal directly. So, we have to cast strings to floats firstly. Then, cast floats to decimals. The first step may lead to precision loss.|false
<a name="sql.castStringToFloat.enabled"></a>spark.rapids.sql.castStringToFloat.enabled|When set to true, enables casting from strings to float types (float, double) on the GPU. Currently hex values aren't supported on the GPU. Also note that casting from string to float types on the GPU returns incorrect results when the string represents any number "1.7976931348623158E308" <= x < "1.7976931348623159E308" and "-1.7976931348623158E308" >= x > "-1.7976931348623159E308" in both these cases the GPU returns Double.MaxValue while CPU returns "+Infinity" and "-Infinity" respectively|false
<a name="sql.castStringToInteger.enabled"></a>spark.rapids.sql.castStringToInteger.enabled|When set to true, enables casting from strings to integer types (byte, short, int, long) on the GPU. Casting from string to integer types on the GPU returns incorrect results when the string represents a number larger than Long.MaxValue or smaller than Long.MinValue.|false
<a name="sql.castStringToTimestamp.enabled"></a>spark.rapids.sql.castStringToTimestamp.enabled|When set to true, casting from string to timestamp is supported on the GPU. The GPU only supports a subset of formats when casting strings to timestamps. Refer to the CAST documentation for more details.|false
<a name="sql.concurrentGpuTasks"></a>spark.rapids.sql.concurrentGpuTasks|Set the number of tasks that can execute concurrently per GPU. Tasks may temporarily block when the number of concurrent tasks in the executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors.|1
<a name="sql.csvTimestamps.enabled"></a>spark.rapids.sql.csvTimestamps.enabled|When set to true, enables the CSV parser to read timestamps. The default output format for Spark includes a timezone at the end. Anything except the UTC timezone is not supported. Timestamps after 2038 and before 1902 are also not supported.|false
Expand Down
4 changes: 2 additions & 2 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -18117,7 +18117,7 @@ and the accelerator produces the same result.
<td><b>NS</b></td>
<td> </td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td>S</td>
<td>S*</td>
<td> </td>
<td> </td>
Expand Down Expand Up @@ -18521,7 +18521,7 @@ and the accelerator produces the same result.
<td><b>NS</b></td>
<td> </td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td>S</td>
<td>S*</td>
<td> </td>
<td> </td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ class Spark311Shims extends Spark301Shims {

// stringChecks are the same
// binaryChecks are the same
override val decimalChecks: TypeSig = none
override val decimalChecks: TypeSig = DECIMAL + STRING
override val sparkDecimalSig: TypeSig = numeric + BOOLEAN + STRING

// calendarChecks are the same
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,9 @@ case class GpuCast(
castDecimalToDecimal(inputVector, from, to)
}

case (_: DecimalType, StringType) =>
input.castTo(DType.STRING)

case _ =>
input.castTo(GpuColumnVector.getNonNestedRapidsType(dataType))
}
Expand Down
18 changes: 18 additions & 0 deletions sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -591,6 +591,22 @@ object RapidsConf {
.booleanConf
.createWithDefault(false)

val ENABLE_CAST_STRING_TO_INTEGER = conf("spark.rapids.sql.castStringToInteger.enabled")
.doc("When set to true, enables casting from strings to integer types (byte, short, " +
"int, long) on the GPU. Casting from string to integer types on the GPU returns incorrect " +
"results when the string represents a number larger than Long.MaxValue or smaller than " +
"Long.MinValue.")
.booleanConf
.createWithDefault(false)

val ENABLE_CAST_DECIMAL_TO_STRING = conf("spark.rapids.sql.castDecimalToString.enabled")
.doc("When set to true, casting from decimal to string is supported on the GPU. The GPU " +
"does NOT produce exact same string as spark produces, but producing strings which are " +
"semantically equal. For instance, given input BigDecimal(123, -2), the GPU produces " +
"\"12300\", which spark produces \"1.23E+4\".")
.booleanConf
.createWithDefault(false)

val ENABLE_CSV_TIMESTAMPS = conf("spark.rapids.sql.csvTimestamps.enabled")
.doc("When set to true, enables the CSV parser to read timestamps. The default output " +
"format for Spark includes a timezone at the end. Anything except the UTC timezone is not " +
Expand Down Expand Up @@ -1200,6 +1216,8 @@ class RapidsConf(conf: Map[String, String]) extends Logging {

lazy val isCastFloatToIntegralTypesEnabled: Boolean = get(ENABLE_CAST_FLOAT_TO_INTEGRAL_TYPES)

lazy val isCastDecimalToStringEnabled: Boolean = get(ENABLE_CAST_DECIMAL_TO_STRING)

lazy val isCsvTimestampEnabled: Boolean = get(ENABLE_CSV_TIMESTAMPS)

lazy val isParquetEnabled: Boolean = get(ENABLE_PARQUET)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -772,7 +772,7 @@ class CastChecks extends ExprChecks {
val binaryChecks: TypeSig = none
val sparkBinarySig: TypeSig = STRING + BINARY

val decimalChecks: TypeSig = DECIMAL
val decimalChecks: TypeSig = DECIMAL + STRING
val sparkDecimalSig: TypeSig = numeric + BOOLEAN + TIMESTAMP + STRING

val calendarChecks: TypeSig = none
Expand Down
14 changes: 14 additions & 0 deletions tests/src/test/scala/com/nvidia/spark/rapids/AnsiCastOpSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,20 @@ class AnsiCastOpSuite extends GpuExpressionTestSuite {
comparisonFunc = Some(compareStringifiedFloats))
}

test("ansi_cast decimal to string") {
val sqlCtx = SparkSession.getActiveSession.get.sqlContext
sqlCtx.setConf("spark.sql.legacy.allowNegativeScaleOfDecimal", "true")
sqlCtx.setConf("spark.rapids.sql.castDecimalToString.enabled", "true")

Seq(10, 15, 18).foreach { precision =>
Seq(-precision, -5, 0, 5, precision).foreach { scale =>
testCastToString(DataTypes.createDecimalType(precision, scale),
ansiMode = true,
comparisonFunc = Some(compareStringifiedDecimalsInSemantic))
}
}
}

private def castToStringExpectedFun[T]: T => Option[String] = (d: T) => Some(String.valueOf(d))

private def testCastToString[T](dataType: DataType, ansiMode: Boolean,
Expand Down
15 changes: 15 additions & 0 deletions tests/src/test/scala/com/nvidia/spark/rapids/CastOpSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,19 @@ class CastOpSuite extends GpuExpressionTestSuite {
testCastToString[Double](DataTypes.DoubleType, comparisonFunc = Some(compareStringifiedFloats))
}

test("cast decimal to string") {
val sqlCtx = SparkSession.getActiveSession.get.sqlContext
sqlCtx.setConf("spark.sql.legacy.allowNegativeScaleOfDecimal", "true")
sqlCtx.setConf("spark.rapids.sql.castDecimalToString.enabled", "true")

Seq(10, 15, 18).foreach { precision =>
Seq(-precision, -5, 0, 5, precision).foreach { scale =>
testCastToString(DataTypes.createDecimalType(precision, scale),
comparisonFunc = Some(compareStringifiedDecimalsInSemantic))
}
}
}

private def testCastToString[T](
dataType: DataType,
comparisonFunc: Option[(String, String) => Boolean] = None) {
Expand Down Expand Up @@ -481,6 +494,7 @@ class CastOpSuite extends GpuExpressionTestSuite {
customRandGenerator = Some(new scala.util.Random(1234L)))
testCastToDecimal(DataTypes.createDecimalType(18, 2),
scale = 2,
ansiEnabled = true,
customRandGenerator = Some(new scala.util.Random(1234L)))

// fromScale > toScale
Expand All @@ -489,6 +503,7 @@ class CastOpSuite extends GpuExpressionTestSuite {
customRandGenerator = Some(new scala.util.Random(1234L)))
testCastToDecimal(DataTypes.createDecimalType(18, 10),
scale = 2,
ansiEnabled = true,
customRandGenerator = Some(new scala.util.Random(1234L)))
testCastToDecimal(DataTypes.createDecimalType(18, 18),
scale = 15,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020, NVIDIA CORPORATION.
* Copyright (c) 2020-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -16,7 +16,7 @@

package com.nvidia.spark.rapids

import org.apache.spark.sql.types.{DataType, DataTypes, DecimalType, StructType}
import org.apache.spark.sql.types.{DataType, DataTypes, Decimal, DecimalType, StructType}

abstract class GpuExpressionTestSuite extends SparkQueryCompareTestSuite {

Expand Down Expand Up @@ -172,6 +172,11 @@ abstract class GpuExpressionTestSuite extends SparkQueryCompareTestSuite {
}
}

def compareStringifiedDecimalsInSemantic(expected: String, actual: String): Boolean = {
(expected == null && actual == null) ||
(expected != null && actual != null && Decimal(expected) == Decimal(actual))
}

private def getAs(column: RapidsHostColumnVector, index: Int, dataType: DataType): Option[Any] = {
if (column.isNullAt(index)) {
None
Expand Down

0 comments on commit 7cbbc12

Please sign in to comment.