Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write cuDF version in Parquet "created_by" metadata field #14721

Merged
merged 12 commits into from
Jan 10, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jan 8, 2024

Description

Populate the informational created_by field in the Parquet file metadata. Identifying the source of a parquet file can help with tracking down interoperability problems.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested review from a team as code owners January 8, 2024 19:20
@etseidl etseidl requested review from bdice and vuule January 8, 2024 19:20
Copy link

copy-pr-bot bot commented Jan 8, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 8, 2024
@vuule vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Jan 8, 2024
@vuule
Copy link
Contributor

vuule commented Jan 8, 2024

/ok to test

@@ -171,7 +171,7 @@ struct aggregate_writer_metadata {
std::vector<std::vector<uint8_t>> column_indexes;
};
std::vector<per_file_metadata> files;
std::string created_by = "";
std::string created_by = "cuDF Version " CUDF_STRINGIFY(CUDF_VERSION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do other libraries write here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stuff like "parquet-cpp-arrow version 12.0.1" or "parquet-mr version 1.11.1 (build 765bd5cd7fdef2af1cecd0755000694b992bfadd)"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we align with those libraries on using version (lowercase) instead of Version (uppercase)? What about cuDF vs. cudf vs. libcudf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with any of those...just checked arrow-rs and they also use lower case "version", so I'll change that. I wasn't sure what to use for the name..."libcudf" was my first choice actually. I'll go with that if there are no objections.

Copy link
Contributor Author

@etseidl etseidl Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the parquet spec actually specifies a format 😮

<Application> version <App Version> (build <App Build Hash>)

Is there a build hash available anywhere? I found a suggestion to use git log -1 --format=%H. I could add that to the properties for building writer_impl.cu.

Managed to get a build hash, but keeping that up to date is beyond my minimal CMake skills. Given that neither arrow-cpp nor arrow-rs feel the need to include the build number, perhaps it's best to punt on that for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with punting on build hashes. How do you feel about that @vuule? If it's important we can inject some CMake define for CUDF_BUILD_HASH and use CUDF_STRINGIFY(CUDF_BUILD_HASH) or similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with punting on the hash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudf is my preference, after further consideration. Python libraries will also call this code, and there is only one "cudf" implementation. Also the other libraries aren't including a "lib" prefix for compiled libraries, so it might be more natural to call it cudf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with cuDF or cudf. FWIW, the ORC specs call us "CUDF".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about cudF....kinda looks like a race car then 🤣

@@ -658,6 +658,12 @@ set_source_files_properties(
PROPERTIES COMPILE_DEFINITIONS "_FILE_OFFSET_BITS=64"
)

set_property(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line 650 above this is using set_source_files_properties so should we use it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a CMake expert 😅. I used set_property since it's also used below for setting the same property for jit/cache.cpp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not cmake expert. Okay probably they both can achieve the same output 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_property is probably easier to work with here because you can APPEND with it. This solution is fine.

@vuule
Copy link
Contributor

vuule commented Jan 9, 2024

/ok to test

@vuule
Copy link
Contributor

vuule commented Jan 9, 2024

That's it, I'm not okaying to test until I see @ttnghia 's approval! :D

Comment on lines 102 to 104
#ifndef CUDF_VERSION
#error "CUDF_VERSION is not defined"
#endif
Copy link
Contributor

@ttnghia ttnghia Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the middle of the aggregate_writer_metadata class so the indentation here is terrible. I would recommend moving it to the top of this file, or moving it inside of the get_metadata below, just right before line 115.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted it close to where it's used, but it looked uglier to me in the middle of the function. Moved it to the top just after the includes.

@ttnghia
Copy link
Contributor

ttnghia commented Jan 9, 2024

/ok to test

@ttnghia
Copy link
Contributor

ttnghia commented Jan 9, 2024

That's it, I'm not okaying to test until I see @ttnghia 's approval! :D

Approved, finally 😄

@@ -658,6 +658,12 @@ set_source_files_properties(
PROPERTIES COMPILE_DEFINITIONS "_FILE_OFFSET_BITS=64"
)

set_property(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_property is probably easier to work with here because you can APPEND with it. This solution is fine.

@vuule
Copy link
Contributor

vuule commented Jan 10, 2024

/merge

@rapids-bot rapids-bot bot merged commit 1078326 into rapidsai:branch-24.02 Jan 10, 2024
67 checks passed
@etseidl etseidl deleted the writer_version branch January 10, 2024 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants