Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CompactProtocolFieldWriter does not write empty value string in key-value pair #14024

Open
ttnghia opened this issue Aug 31, 2023 · 7 comments
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Aug 31, 2023

During debugging a Parquet issue, I found that the Parquet metadata is not correctly written by cudf. In particular, in a round trip test, a pair of metadata {"key_str", ""} is written then read as {"key_str", null} in Spark. This is due to the CompactProtocolFieldWriter does not write empty value string:

size_t CompactProtocolWriter::write(KeyValue const& k)
{
  CompactProtocolFieldWriter c(*this);
  c.field_string(1, k.key);
  if (not k.value.empty()) { c.field_string(2, k.value); }                 <================ here
  return c.value();
}

We should write all value strings even for empty value.

However, I'm not sure if this is the expected behavior of compact protocol?

@ttnghia ttnghia added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Aug 31, 2023
@etseidl
Copy link
Contributor

etseidl commented Aug 31, 2023

The value in KeyValue is an optional field in the spec, so it's correct to write nothing if it isn't set. In #14000 I've started using std::optional for optional fields, to distinguish from merely empty but present fields. Perhaps that could be done throughout the Parquet thrift objects.

@vuule
Copy link
Contributor

vuule commented Aug 31, 2023

@etseidl I assume that the key-value pair is optional, but is it okay to include the key without the value?

@etseidl
Copy link
Contributor

etseidl commented Aug 31, 2023

@etseidl I assume that the key-value pair is optional, but is it okay to include the key without the value?

According to the thrift idl, yes.

 struct KeyValue {
  1: required string key
  2: optional string value
}

Also optional in FileMetaData

struct FileMetaData {
  ...
  /** Optional key/value metadata **/
  5: optional list<KeyValue> key_value_metadata
}

@ttnghia
Copy link
Contributor Author

ttnghia commented Aug 31, 2023

That's interesting. Then such "optional" should mean the user can specify it as an optional, instead of the current behavior (the library ignores writing empty string without the user's knowledge). So #14000 should be the way to go instead.

@etseidl
Copy link
Contributor

etseidl commented Aug 31, 2023

That's interesting. Then such "optional" should mean the user can specify it as an optional, instead of the current behavior (the library ignores writing empty string without the user's knowledge). So #14000 should be the way to go instead.

Yes, I think making more use of optional would make things clearer (although I didn't notice someone beat me to it 😅 @PointKernel apparently)

class ParquetFieldOptionalInt32 {

@PointKernel
Copy link
Member

Yeah, using std::optional to properly handle optional field reading/writing was on my TODO list but then got distracted by other tasks.

rapids-bot bot pushed a commit that referenced this issue Sep 6, 2023
…from writing (#14026)

When writing to Parquet files, Spark needs to write pairs of key-value strings into files' metadata. Sometimes the value strings are just an empty string. Such empty string is ignored from writing into the file, causing other applications (such as Spark) to read the value and interpret it as a `null` instead of an empty string as in the original input, as described in #14024. This is wrong and led to data corruption as I tested.

This PR intentionally modifies the empty value string into a space character to workaround the bug. This is a temporary fix while waiting for a better fix to be worked on.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #14026
rapids-bot bot pushed a commit that referenced this issue Sep 20, 2023
Refactors the current `CompactProtocolReader` used to parse parquet file metadata. The main goal of the refactor is to allow easier use of `std::optional` fields in the thrift structs to prevent situations as in #14024 where an optional field is an empty string. The writer cannot distinguish between present-but-empty and not-present, so chooses the latter when writing the field. This PR adds a `ParquetFieldOptional` functor that can wrap the other field functors, obviating the need to write a new optional functor for each type.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #14097
@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 27, 2023
@vuule
Copy link
Contributor

vuule commented Feb 16, 2024

This issue has been at least partially addressed. We need to make another pass over CompactProtocolReader::read functions to verify that all optional fields are treated as such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

5 participants