Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve integer dtype of hive-partitioned column containing nulls #12930

Merged
merged 8 commits into from
Mar 15, 2023

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Mar 13, 2023

Description

This is a follow-up "fix" for #12866
While that PR enables the writing/reading of null hive partitions using dask_cudf, it does not preserve the type of integer partition columns containing nulls. This PR should address the remaining issue.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Mar 13, 2023
@rjzamora
Copy link
Member Author

cc @randerzander

@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 13, 2023
Comment on lines 633 to 637
dfs[-1][name] = as_column(
value,
length=len(dfs[-1]),
_value,
length=_len,
dtype=partition_meta[name].dtype,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galipremsagar - Do you know the most efficient way to construct a completely-null column in cudf?

Copy link
Contributor

@galipremsagar galipremsagar Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you can do cudf.core.column.column_empty(row_count=..., dtype=..., masked=True) for completely null column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if you want to fill with a specific value, we have cudf.core.column.full(size=..., fill_value=..., dtype=...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rjzamora rjzamora marked this pull request as ready for review March 13, 2023 23:10
@rjzamora rjzamora requested review from a team as code owners March 13, 2023 23:10
@rjzamora rjzamora requested review from vyasr and skirui-source March 13, 2023 23:10
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 14, 2023
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks fine (although I don't know a lot about this partitioning scheme).

I had a comment about the uniquifying of names but I see that this is a long piece of string to pull on.

python/cudf/cudf/io/parquet.py Show resolved Hide resolved

part_names = (
part_keys.to_pandas(nullable=True).unique().to_frame(index=False)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is not:

part_keys.unique().to_pandas(nullable=True).to_frame(index=False)

It seems strange to copy the big thing and then uniquify. The only difference I can see is that pandas puts <NA> at the end of a unique'd index, whereas cudf puts it at the beginning in an unspecified location. I guess that is probably the reason if these names are going to be used to save the groups one-by-one, there'll be a mismatch in the naming.

Related: #11597 added preserve_order to Column.unique but didn't propagate to either Index.unique or Series.unique.

See also performance-related discussions in #5286 et seq.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. It's a shame to be using pandas to perform unique, but we need to preserve the initial order (including nulls), and we cannot rely on Column.unique, because part_keys may be a MultiIndex here. The good news is that we should be able to do part_keys.take(part_offsets[:-1]) instead of unique anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the number of partition keys is small, I don' think pandas usage is really a performance footgun...

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 15, 2023
@rjzamora
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit ced3fdf into rapidsai:branch-23.04 Mar 15, 2023
@rjzamora rjzamora deleted the preserve-int-partitions branch March 15, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working dask Dask issue non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants