Preserve integer dtype of hive-partitioned column containing nulls #12930

rjzamora · 2023-03-13T17:31:32Z

Description

This is a follow-up "fix" for #12866
While that PR enables the writing/reading of null hive partitions using dask_cudf, it does not preserve the type of integer partition columns containing nulls. This PR should address the remaining issue.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rjzamora · 2023-03-13T17:31:50Z

cc @randerzander

rjzamora · 2023-03-13T17:40:28Z

python/cudf/cudf/io/parquet.py

                    dfs[-1][name] = as_column(
-                        value,
-                        length=len(dfs[-1]),
+                        _value,
+                        length=_len,
                        dtype=partition_meta[name].dtype,
                    )


@galipremsagar - Do you know the most efficient way to construct a completely-null column in cudf?

Yup, you can do cudf.core.column.column_empty(row_count=..., dtype=..., masked=True) for completely null column

or if you want to fill with a specific value, we have cudf.core.column.full(size=..., fill_value=..., dtype=...)

wence-

I think this looks fine (although I don't know a lot about this partitioning scheme).

I had a comment about the uniquifying of names but I see that this is a long piece of string to pull on.

python/cudf/cudf/io/parquet.py

wence- · 2023-03-14T11:14:51Z

python/cudf/cudf/io/parquet.py

-
+    part_names = (
+        part_keys.to_pandas(nullable=True).unique().to_frame(index=False)
+    )


Is there a reason this is not:

part_keys.unique().to_pandas(nullable=True).to_frame(index=False)

It seems strange to copy the big thing and then uniquify. The only difference I can see is that pandas puts <NA> at the end of a unique'd index, whereas cudf puts it ~~at the beginning~~ in an unspecified location. I guess that is probably the reason if these names are going to be used to save the groups one-by-one, there'll be a mismatch in the naming.

Related: #11597 added preserve_order to Column.unique but didn't propagate to either Index.unique or Series.unique.

See also performance-related discussions in #5286 et seq.

This is a good point. It's a shame to be using pandas to perform unique, but we need to preserve the initial order (including nulls), and we cannot rely on Column.unique, because part_keys may be a MultiIndex here. The good news is that we should be able to do part_keys.take(part_offsets[:-1]) instead of unique anyway.

Assuming the number of partition keys is small, I don' think pandas usage is really a performance footgun...

…t-partitions

…/cudf into preserve-int-partitions

rjzamora · 2023-03-15T14:58:29Z

/merge

preserve int dtype of nullable partition column values

febc650

rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Mar 13, 2023

Merge branch 'branch-23.04' into preserve-int-partitions

210b0d2

github-actions bot added the Python Affects Python cuDF API. label Mar 13, 2023

rjzamora commented Mar 13, 2023

View reviewed changes

rjzamora added 2 commits March 13, 2023 11:15

use column_empty and full

e74feb4

Merge branch 'branch-23.04' into preserve-int-partitions

10de0f7

rjzamora marked this pull request as ready for review March 13, 2023 23:10

rjzamora requested review from a team as code owners March 13, 2023 23:10

rjzamora requested review from vyasr and skirui-source March 13, 2023 23:10

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 14, 2023

galipremsagar approved these changes Mar 14, 2023

View reviewed changes

wence- approved these changes Mar 14, 2023

View reviewed changes

rjzamora added 4 commits March 14, 2023 10:38

use take instead of unique

8d4d83b

Merge remote-tracking branch 'upstream/branch-23.04' into preserve-in…

61f35ae

…t-partitions

Merge branch 'preserve-int-partitions' of https://github.com/rjzamora…

28d8784

…/cudf into preserve-int-partitions

Merge branch 'branch-23.04' into preserve-int-partitions

ccf6c50

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 15, 2023

rapids-bot bot merged commit ced3fdf into rapidsai:branch-23.04 Mar 15, 2023

rjzamora deleted the preserve-int-partitions branch March 15, 2023 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve integer dtype of hive-partitioned column containing nulls #12930

Preserve integer dtype of hive-partitioned column containing nulls #12930

rjzamora commented Mar 13, 2023 •

edited

Loading

rjzamora commented Mar 13, 2023

rjzamora Mar 13, 2023

galipremsagar Mar 13, 2023 •

edited

Loading

galipremsagar Mar 13, 2023

rjzamora Mar 13, 2023

wence- left a comment

wence- Mar 14, 2023

rjzamora Mar 14, 2023

wence- Mar 14, 2023

rjzamora commented Mar 15, 2023

Preserve integer dtype of hive-partitioned column containing nulls #12930

Preserve integer dtype of hive-partitioned column containing nulls #12930

Conversation

rjzamora commented Mar 13, 2023 • edited Loading

Description

Checklist

rjzamora commented Mar 13, 2023

rjzamora Mar 13, 2023

Choose a reason for hiding this comment

galipremsagar Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

galipremsagar Mar 13, 2023

Choose a reason for hiding this comment

rjzamora Mar 13, 2023

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

wence- Mar 14, 2023

Choose a reason for hiding this comment

rjzamora Mar 14, 2023

Choose a reason for hiding this comment

wence- Mar 14, 2023

Choose a reason for hiding this comment

rjzamora commented Mar 15, 2023

rjzamora commented Mar 13, 2023 •

edited

Loading

galipremsagar Mar 13, 2023 •

edited

Loading