Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Struct support for ParquetWriter #2514

Merged
merged 8 commits into from
May 28, 2021
4 changes: 2 additions & 2 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -537,7 +537,7 @@ Accelerator supports are described below.
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS* (Only supported for Parquet; missing nested NULL, BINARY, CALENDAR, ARRAY, MAP, UDT)</em></td>
<td><b>NS</b></td>
</tr>
<tr>
Expand Down Expand Up @@ -20629,7 +20629,7 @@ dates or timestamps, or for a lack of type coercion support.
<td> </td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS* (missing nested BINARY, ARRAY, MAP, UDT)</em></td>
<td><b>NS</b></td>
</tr>
</table>
10 changes: 5 additions & 5 deletions integration_tests/src/main/python/parquet_write_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,11 @@
TimestampGen(start=datetime(1677, 9, 22, tzinfo=timezone.utc), end=datetime(2262, 4, 11, tzinfo=timezone.utc))]
parquet_basic_struct_gen = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(parquet_basic_gen)])

parquet_write_gens_list = [
parquet_basic_gen,
pytest.param(parquet_basic_gen + [decimal_gen_default,
decimal_gen_scale_precision, decimal_gen_same_scale_precision, decimal_gen_64bit],
marks=pytest.mark.allow_non_gpu("CoalesceExec"))]
parquet_write_gens_list = [parquet_basic_gen,
pytest.param([decimal_gen_default, parquet_basic_struct_gen,
StructGen([['child0', StructGen([['child1', byte_gen]])]]),
decimal_gen_scale_precision, decimal_gen_same_scale_precision, decimal_gen_64bit],
marks=pytest.mark.allow_non_gpu("CoalesceExec"))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is CoalesceExec not on the GPU? and is there an open issue that will put it on the GPU? If not we should file one and reference it here. If so we should just reference it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is fixed by #2530 which added struct support for coalesce.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR to not allow Coalesce with Structs to run on CPU

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code still has allow_non_gpu("CoalesceExec")?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, its not refflecting it here for some reason, unless you go to the last commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK you are seeing the latest, I wasn't seeing it until I looked at the commit specifically.

So that is needed because CoalesceExec doesn't support Decimals yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that is needed because CoalesceExec doesn't support Decimals yet

#2531 explicitly states it adds decimal support and appears to be testing it. So if that is somehow not working in practice an issue needs to be filed and fixed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for wasting your time. I didn't see _commonTypes was added and thought it was TypeSig.commonCudfTypes


parquet_ts_write_options = ['INT96', 'TIMESTAMP_MICROS', 'TIMESTAMP_MILLIS']

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -761,7 +761,7 @@ object GpuOverrides {
(ParquetFormatType, FileFormatChecks(
cudfRead = (TypeSig.commonCudfTypes + TypeSig.DECIMAL + TypeSig.STRUCT + TypeSig.ARRAY +
TypeSig.MAP).nested(),
cudfWrite = TypeSig.commonCudfTypes + TypeSig.DECIMAL,
cudfWrite = (TypeSig.commonCudfTypes + TypeSig.DECIMAL + TypeSig.STRUCT).nested(),
sparkSig = (TypeSig.atomics + TypeSig.STRUCT + TypeSig.ARRAY + TypeSig.MAP +
TypeSig.UDT).nested())),
(OrcFormatType, FileFormatChecks(
Expand Down Expand Up @@ -2835,7 +2835,8 @@ object GpuOverrides {
exec[DataWritingCommandExec](
"Writing data",
ExecChecks((TypeSig.commonCudfTypes +
TypeSig.DECIMAL.withPsNote(TypeEnum.DECIMAL, "Only supported for Parquet")).nested(),
TypeSig.DECIMAL.withPsNote(TypeEnum.DECIMAL, "Only supported for Parquet") +
TypeSig.STRUCT.withPsNote(TypeEnum.STRUCT, "Only supported for Parquet")).nested(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing a test? When the struct support went in, the fact that we forgot to update this should have been caught by the test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @jlowe I remembered that we added struct support for GpuInMemoryTableScanExec and confused it with DataWritingCommandExec. We have to turn this on as part of this PR. I have updated the PR name to reflect it

TypeSig.all),
(p, conf, parent, r) => new SparkPlanMeta[DataWritingCommandExec](p, conf, parent, r) {
override val childDataWriteCmds: scala.Seq[DataWritingCommandMeta[_]] =
Expand Down