-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve integer dtype of hive-partitioned column containing nulls #12930
Preserve integer dtype of hive-partitioned column containing nulls #12930
Conversation
python/cudf/cudf/io/parquet.py
Outdated
dfs[-1][name] = as_column( | ||
value, | ||
length=len(dfs[-1]), | ||
_value, | ||
length=_len, | ||
dtype=partition_meta[name].dtype, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@galipremsagar - Do you know the most efficient way to construct a completely-null column in cudf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, you can do cudf.core.column.column_empty(row_count=..., dtype=..., masked=True)
for completely null column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or if you want to fill with a specific value, we have cudf.core.column.full(size=..., fill_value=..., dtype=...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks fine (although I don't know a lot about this partitioning scheme).
I had a comment about the uniquifying of names but I see that this is a long piece of string to pull on.
|
||
part_names = ( | ||
part_keys.to_pandas(nullable=True).unique().to_frame(index=False) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is not:
part_keys.unique().to_pandas(nullable=True).to_frame(index=False)
It seems strange to copy the big thing and then uniquify. The only difference I can see is that pandas puts <NA>
at the end of a unique'd index, whereas cudf puts it at the beginning in an unspecified location. I guess that is probably the reason if these names are going to be used to save the groups one-by-one, there'll be a mismatch in the naming.
Related: #11597 added preserve_order
to Column.unique
but didn't propagate to either Index.unique
or Series.unique
.
See also performance-related discussions in #5286 et seq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. It's a shame to be using pandas to perform unique
, but we need to preserve the initial order (including nulls), and we cannot rely on Column.unique
, because part_keys
may be a MultiIndex
here. The good news is that we should be able to do part_keys.take(part_offsets[:-1])
instead of unique
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the number of partition keys is small, I don' think pandas usage is really a performance footgun...
…/cudf into preserve-int-partitions
/merge |
Description
This is a follow-up "fix" for #12866
While that PR enables the writing/reading of null hive partitions using
dask_cudf
, it does not preserve the type of integer partition columns containing nulls. This PR should address the remaining issue.Checklist