Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor LogicalType for Parquet #14264

Merged
merged 17 commits into from
Oct 20, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 10, 2023

Description

Continuation of #14097, this PR refactors the LogicalType struct to use the new way of treating unions defined in the parquet thrift (more enum like than struct like).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner October 10, 2023 00:27
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 10, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 10, 2023
@etseidl
Copy link
Contributor Author

etseidl commented Oct 10, 2023

The ultimate goal is to allow greater use of LogicalType in the parquet reader and writer, so we can rely less on special cases that test the input or output length.

@vuule vuule self-requested a review October 10, 2023 17:21
@vuule vuule added code quality non-breaking Non-breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function labels Oct 10, 2023
@ttnghia
Copy link
Contributor

ttnghia commented Oct 10, 2023

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of small questions/suggestions

// isset.TIME or isset.TIMESTAMP or isset.INTEGER or isset.UNKNOWN or isset.JSON or isset.BSON)
// {
if (isset.TIMESTAMP or isset.TIME) { c.field_struct(10, s.logical_type); }
if (s.field_id.has_value()) { c.field_int(9, s.field_id.value()); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we could probably get rid of this if has_value -> write value pattern with SFINAE in ProtobufWriter, assuming that the field_xyz names map to the actual type of the parameter.
If all types were supported with an overload set, optional support could be a part of the set, and simply delegate to <T::value> implementation if has_value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of template foo to avoid a few invocations of has_value. We can leave this for another day :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would have a larger impact than that, but maybe I'm mixing it up with the reader side. Either way it's not a suggestion for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not sure if I'm on the same page)
How about std::visit to deal with such template issue and optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be in the next refactor of CompactProtocolReader/Writer 🤣 (or should I say compact_protocol_reader/writer?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think std::visit is applicable here. Outside of reflection that would allow us to iterate over data members (which AFAIK does not exist), I think an overload set is as good as this gets. I'd be happy to learn about other solutions.

cpp/src/io/parquet/parquet.hpp Show resolved Hide resolved
cpp/src/io/parquet/compact_protocol_writer.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet.hpp Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
@etseidl etseidl requested a review from vuule October 11, 2023 23:57
@vuule
Copy link
Contributor

vuule commented Oct 12, 2023

/ok to test

NullType UNKNOWN;
JsonType JSON;
BsonType BSON;
enum Type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Type enums are already buried inside the struct, so we're already getting the benefits of a scoped enum. And keeping it non-scoped allows me to use the enum values as the positional argument to the field_struct calls in the writer without having to do a cast.

@vuule
Copy link
Contributor

vuule commented Oct 20, 2023

/ok to test

@vuule vuule requested a review from ttnghia October 20, 2023 18:50
@vuule
Copy link
Contributor

vuule commented Oct 20, 2023

/merge

@rapids-bot rapids-bot bot merged commit 253f6a6 into rapidsai:branch-23.12 Oct 20, 2023
54 checks passed
@etseidl etseidl deleted the refactor_logical_type branch October 20, 2023 21:20
@rjzamora
Copy link
Member

@galipremsagar - As far as I can tell the test_create_metadata_file_inconsistent_schema failures must be coming from a change in this PR. Test passes for the commit just before this was merged.

@vuule
Copy link
Contributor

vuule commented Oct 24, 2023

Hi @rjzamora, we identified other issue with changes to timestamp logical type and opened #14322 to fix them. Do you think this change would fix your failing test?

@rjzamora
Copy link
Member

Not sure, but I don't immediately think so. The test failure we are investigating suggests that the behavior of cudf.io.merge_parquet_filemetadata has changed. We have a test where we aggregate footer metadata into a global _metadata file. One of those files contains all null values in one of the columns, while the other file contains int. We expect the aggregated metadata to "promote" the null type to align with the int type.

@vuule
Copy link
Contributor

vuule commented Oct 24, 2023

@rjzamora can you please point us to the failing tests? Is there an issue open?

@rjzamora
Copy link
Member

rjzamora commented Oct 24, 2023

Failing test is here:

def test_create_metadata_file_inconsistent_schema(tmpdir):
- I don't believe it is showing up in CI anywhere because all dask-cudf tests are being skipped for some reason.

I'll trying to boil this down to a simpler reproducer (if possible).

@vuule
Copy link
Contributor

vuule commented Oct 24, 2023

are being skipped for some reason.

oof

@galipremsagar
Copy link
Contributor

are being skipped for some reason.

🤯 🤯 🤯

@etseidl
Copy link
Contributor Author

etseidl commented Oct 25, 2023

@rjzamora I think I'm running this to ground...it seems that the blob passed in to merge_row_group_metadata has the logical type annotation UNKNOWN, which is used for all null columns where the schema type can't be discovered. The change in this PR is causing that UNKNOWN to be written back out in the merged metadata, even though the physical type is written as INT32. I suspect that having the logical type present is throwing off whatever dask is doing to infer the type when it uses the _metadata file. I can think of two ways to fix this...either change the thrift protocol writer to not write the logical type if the type is 'UNKNOWN', or override an UNKNOWN logical type in merge_row_group_metadata. I'll have to discuss with @vuule offline to see which approach to take.

@rjzamora
Copy link
Member

Thanks for looking into this @etseidl !

I suspect that having the logical type present is throwing off whatever dask is doing to infer the type when it uses the _metadata file.

Dask is using whatever type pyarrow.dataset happens to infer. I wouldn't be surprised if it was just using the logical type.

I can think of two ways to fix this...either change the thrift protocol writer to not write the logical type if the type is 'UNKNOWN', or override an UNKNOWN logical type in merge_row_group_metadata. I'll have to discuss with @vuule offline to see which approach to take.

Okay, either approach sounds reasonable to me. To be completely honest, I'm somewhat doubtful that there is anyone actually depending on the behavior covered in this test. In fact, we are already expecting the "wrong" result for dask.dataframe. Therefore, if this proves tricky to fix, I think it should be fine to modify or xfail the test for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants