Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace toTitle with capitalize for GpuInitCap #2838

Merged
merged 15 commits into from
Jul 7, 2021
2 changes: 1 addition & 1 deletion docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.If"></a>spark.rapids.sql.expression.If|`if`|IF expression|true|None|
<a name="sql.expression.In"></a>spark.rapids.sql.expression.In|`in`|IN operator|true|None|
<a name="sql.expression.InSet"></a>spark.rapids.sql.expression.InSet| |INSET operator|true|None|
<a name="sql.expression.InitCap"></a>spark.rapids.sql.expression.InitCap|`initcap`|Returns str with the first letter of each word in uppercase. All other letters are in lowercase|false|This is not 100% compatible with the Spark version because in some cases Unicode characters change byte width when changing the case. The GPU string conversion does not support these characters for capitalize now. This will be fixed in the future per https://github.com/rapidsai/cudf/issues/8644. For a full list of unsupported characters see https://github.com/rapidsai/cudf/issues/3132.|
<a name="sql.expression.InitCap"></a>spark.rapids.sql.expression.InitCap|`initcap`|Returns str with the first letter of each word in uppercase. All other letters are in lowercase|false|This is not 100% compatible with the Spark version because the Unicode version used by cuDF and the JVM may differ, resulting in some corner-case characters not changing case correctly.|
<a name="sql.expression.InputFileBlockLength"></a>spark.rapids.sql.expression.InputFileBlockLength|`input_file_block_length`|Returns the length of the block being read, or -1 if not available|true|None|
<a name="sql.expression.InputFileBlockStart"></a>spark.rapids.sql.expression.InputFileBlockStart|`input_file_block_start`|Returns the start offset of the block being read, or -1 if not available|true|None|
<a name="sql.expression.InputFileName"></a>spark.rapids.sql.expression.InputFileName|`input_file_name`|Returns the name of the file being read, or empty string if not available|true|None|
Expand Down
2 changes: 1 addition & 1 deletion docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -8013,7 +8013,7 @@ Accelerator support is described below.
<td rowSpan="4">InitCap</td>
<td rowSpan="4">`initcap`</td>
<td rowSpan="4">Returns str with the first letter of each word in uppercase. All other letters are in lowercase</td>
<td rowSpan="4">This is not 100% compatible with the Spark version because in some cases Unicode characters change byte width when changing the case. The GPU string conversion does not support these characters for capitalize now. This will be fixed in the future per https://github.com/rapidsai/cudf/issues/8644. For a full list of unsupported characters see https://github.com/rapidsai/cudf/issues/3132.</td>
<td rowSpan="4">This is not 100% compatible with the Spark version because the Unicode version used by cuDF and the JVM may differ, resulting in some corner-case characters not changing case correctly.</td>
<td rowSpan="2">project</td>
<td>input</td>
<td> </td>
Expand Down
8 changes: 4 additions & 4 deletions integration_tests/src/main/python/string_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,15 +320,15 @@ def test_initcap():
# the charicter set to something more reasonable
# upper and lower should cover the corner cases, this is mostly to
# see if there are issues with spaces
gen = mk_str_gen('([aAbB1357_@%-]{0,12}[ \r\n\t]{1,2}){1,5}')
gen = mk_str_gen('([aAbB1357ȺéŸ_@%-]{0,15}[ \r\n\t]{1,2}){1,5}')
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, gen).select(
f.initcap(f.col('a'))))

@incompat
@pytest.mark.xfail(reason='https://github.com/rapidsai/cudf/issues/8644')
def test_initcap_width_change():
gen = mk_str_gen('ʼn([aAbB13ʼnȺéʼnŸ]{0,5}){1,5}')
@pytest.mark.xfail(reason='Spark initcap will not convert ʼn to ʼN')
jlowe marked this conversation as resolved.
Show resolved Hide resolved
def test_initcap_special_chars():
gen = mk_str_gen('ʼn([aAbB13ȺéŸ]{0,5}){1,5}')
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, gen).select(
f.initcap(f.col('a'))))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1249,10 +1249,7 @@ object GpuOverrides {
ExprChecks.unaryProjectNotLambdaInputMatchesOutput(TypeSig.STRING, TypeSig.STRING),
(a, conf, p, r) => new UnaryExprMeta[InitCap](a, conf, p, r) {
override def convertToGpu(child: Expression): GpuExpression = GpuInitCap(child)
}).incompat("in some cases Unicode characters change byte width when changing the case." +
" The GPU string conversion does not support these characters for capitalize now. This" +
" will be fixed in the future per https://github.com/rapidsai/cudf/issues/8644. For a" +
" full list of unsupported characters see https://github.com/rapidsai/cudf/issues/3132."),
}).incompat(CASE_MODIFICATION_INCOMPAT),
jlowe marked this conversation as resolved.
Show resolved Hide resolved
expr[Log](
"Natural log",
ExprChecks.mathUnary,
Expand Down