Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader #13437

Merged
merged 34 commits into from
Aug 24, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
138cf3f
Read fixed len byte array as string
PointKernel May 24, 2023
f39a02f
Merge remote-tracking branch 'upstream/branch-23.08' into parquet-fix…
PointKernel May 24, 2023
64783d2
Merge with upstream
PointKernel Jul 3, 2023
fdc8b7e
Merge remote-tracking branch 'upstream/branch-23.08' into parquet-fix…
PointKernel Jul 3, 2023
c7b32b2
TEST
PointKernel Jul 5, 2023
2c685a2
Cleanups
PointKernel Jul 5, 2023
cb6825f
Revert test changes
PointKernel Jul 5, 2023
c75675b
Cleanups
PointKernel Jul 5, 2023
fac2eff
Set up dtype_len properly
PointKernel Jul 5, 2023
86e41a7
updates
PointKernel Jul 5, 2023
7f363b1
Cleanup
PointKernel Jul 5, 2023
70ef046
Use switch instead of if conditions
PointKernel Jul 5, 2023
8eee919
Merge remote-tracking branch 'upstream/branch-23.08' into parquet-fix…
PointKernel Jul 5, 2023
de90782
Merge remote-tracking branch 'upstream/branch-23.10' into parquet-fix…
PointKernel Aug 10, 2023
b393b7b
Fix a bug in string col determination
PointKernel Aug 10, 2023
aab1ced
Fix a bug in string col determination
PointKernel Aug 10, 2023
eeafbf7
Fix a bug: break non-default switchcase
PointKernel Aug 10, 2023
7209027
Cleanups: shortcircuit for FIXED_LEN_BYTE_ARRAY
PointKernel Aug 10, 2023
ac3b7a4
Read FIXED_LEN_BYTE_ARRAY always as binary
PointKernel Aug 11, 2023
a2a622d
Add FIXED_LEN_BYTE_ARRAY pytest
PointKernel Aug 11, 2023
5d01183
Minor fix: compare the lower 3 bits only
PointKernel Aug 11, 2023
9f5ff20
Merge branch 'branch-23.10' into parquet-fixed-len-binary
PointKernel Aug 14, 2023
50ec387
Use a smaller test file
PointKernel Aug 16, 2023
f756e56
Merge remote-tracking branch 'upstream/branch-23.10' into parquet-fix…
PointKernel Aug 16, 2023
6304001
Merge branch 'parquet-fixed-len-binary' of github.com:PointKernel/cud…
PointKernel Aug 16, 2023
a969c00
Minor cleanup: init right after declaration to get rid of if branches
PointKernel Aug 17, 2023
277557c
Cleanups: descriptive name for intermediate vars
PointKernel Aug 17, 2023
b712f5f
Add comment
PointKernel Aug 17, 2023
efc495d
Merge remote-tracking branch 'upstream/branch-23.10' into parquet-fix…
PointKernel Aug 17, 2023
2427416
Update cpp/src/io/parquet/reader_impl.cpp
PointKernel Aug 17, 2023
77b3d38
Merge branch 'branch-23.10' into parquet-fixed-len-binary
PointKernel Aug 21, 2023
6453662
Merge branch 'branch-23.10' into parquet-fixed-len-binary
vuule Aug 23, 2023
dfdcb76
Merge branch 'branch-23.10' into parquet-fixed-len-binary
vuule Aug 24, 2023
ed1c81c
Update python/cudf/cudf/tests/test_parquet.py
PointKernel Aug 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 13 additions & 4 deletions cpp/src/io/parquet/page_data.cu
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ inline __device__ void gpuOutputString(volatile page_state_s* s,
void* dstv)
{
auto [ptr, len] = gpuGetStringData(s, sb, src_pos);
if (s->dtype_len == 4) {
if (s->dtype_len == 4 and (s->col.data_type & 7) == BYTE_ARRAY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be a logical and? It seems in the past we would take this path if the data size was 4 bytes, but now we only do it if the data size if 4 bytes AND it is a byte array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, only BYTE_ARRAY invokes gpuOutputString and the input length cannot be 4 in that case. Now with function being potentially invoked by FIXED_LEN_BYTE_ARRAY where the length could be 4, this and logic is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a comment along the lines of "// make sure to only hash variable length byte arrays when specified with the output type size"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really related to this PR, but we have this dtype_len == 4 condition as a stand-in for "output hash" in a few places and it really requires detailed knowledge of the code to understand. Nothing actionable, I just don't like it :D

// Output hash. This hash value is used if the option to convert strings to
// categoricals is enabled. The seed value is chosen arbitrarily.
uint32_t constexpr hash_seed = 33;
Expand Down Expand Up @@ -773,8 +773,10 @@ __global__ void __launch_bounds__(decode_block_size) gpuDecodePageData(
if (s->dict_base) {
out_thread0 = (s->dict_bits > 0) ? 64 : 32;
} else {
out_thread0 =
((s->col.data_type & 7) == BOOLEAN || (s->col.data_type & 7) == BYTE_ARRAY) ? 64 : 32;
out_thread0 = ((s->col.data_type & 7) == BOOLEAN || (s->col.data_type & 7) == BYTE_ARRAY ||
(s->col.data_type & 7) == FIXED_LEN_BYTE_ARRAY)
? 64
: 32;
}

PageNestingDecodeInfo* nesting_info_base = s->nesting_info;
Expand Down Expand Up @@ -811,7 +813,8 @@ __global__ void __launch_bounds__(decode_block_size) gpuDecodePageData(
src_target_pos = gpuDecodeDictionaryIndices<false>(s, sb, src_target_pos, t & 0x1f).first;
} else if ((s->col.data_type & 7) == BOOLEAN) {
src_target_pos = gpuDecodeRleBooleans(s, sb, src_target_pos, t & 0x1f);
} else if ((s->col.data_type & 7) == BYTE_ARRAY) {
} else if ((s->col.data_type & 7) == BYTE_ARRAY or
(s->col.data_type & 7) == FIXED_LEN_BYTE_ARRAY) {
gpuInitStringDescriptors<false>(s, sb, src_target_pos, t & 0x1f);
}
if (t == 32) { *(volatile int32_t*)&s->dict_pos = src_target_pos; }
Expand Down Expand Up @@ -881,6 +884,12 @@ __global__ void __launch_bounds__(decode_block_size) gpuDecodePageData(
}
break;
}
} else if (dtype == FIXED_LEN_BYTE_ARRAY) {
gpuOutputString(s,
sb,
val_src_pos,
nesting_info_base[leaf_level_index].data_out +
static_cast<size_t>(dst_pos) * sizeof(string_index_pair));
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
} else if (dtype == INT96) {
gpuOutputInt96Timestamp(s, sb, val_src_pos, static_cast<int64_t*>(dst));
} else if (dtype_len == 8) {
Expand Down
23 changes: 17 additions & 6 deletions cpp/src/io/parquet/page_decode.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -418,12 +418,20 @@ __device__ size_type gpuInitStringDescriptors(page_state_s volatile* s,

while (pos < target_pos) {
int len;
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
if (k + 4 <= dict_size) {
len = (cur[k]) | (cur[k + 1] << 8) | (cur[k + 2] << 16) | (cur[k + 3] << 24);
k += 4;
if (k + len > dict_size) { len = 0; }
if ((s->col.data_type & 7) == FIXED_LEN_BYTE_ARRAY) {
if (k < dict_size) {
len = s->dtype_len_in;
} else {
len = 0;
}
} else {
len = 0;
if (k + 4 <= dict_size) {
len = (cur[k]) | (cur[k + 1] << 8) | (cur[k + 2] << 16) | (cur[k + 3] << 24);
k += 4;
if (k + len > dict_size) { len = 0; }
} else {
len = 0;
}
}
if constexpr (!sizes_only) {
sb->dict_idx[rolling_index(pos)] = k;
Expand Down Expand Up @@ -1198,7 +1206,10 @@ inline __device__ bool setupLocalPageInfo(page_state_s* const s,
uint32_t len = idx < max_depth - 1 ? sizeof(cudf::size_type) : s->dtype_len;
// if this is a string column, then dtype_len is a lie. data will be offsets rather
// than (ptr,len) tuples.
if (data_type == BYTE_ARRAY && s->dtype_len != 4) { len = sizeof(cudf::size_type); }
if ((data_type == BYTE_ARRAY && s->dtype_len != 4) ||
data_type == FIXED_LEN_BYTE_ARRAY) {
len = sizeof(cudf::size_type);
}
nesting_info->data_out += (output_offset * len);
}
if (nesting_info->string_out != nullptr) {
Expand Down
38 changes: 36 additions & 2 deletions cpp/src/io/parquet/page_string_decode.cu
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,36 @@ __device__ size_t totalDictEntriesSize(uint8_t const* data,
return sum_l;
}

/**
* @brief Compute string size information for fixed len byte array strings.
*
* @param input_size Size of each fixed length byte array
* @param data_size Length of data
* @param start_value Do not count values that occur before this index
* @param end_value Do not count values that occur after this index
*/
__device__ size_t totalFixedLenByteArraySize(int input_size,
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
int data_size,
int start_value,
int end_value)
{
int const t = threadIdx.x;
int pos = 0;
size_t total_len = 0;

// This step is purely serial
if (!t) {
int k = 0;
while (pos < end_value && k < data_size) {
k += input_size;
if (pos >= start_value) { total_len += input_size; } // TODO not sure this is necessary
pos++;
}
}

return total_len;
}

/**
* @brief Compute string size information for plain encoded strings.
*
Expand Down Expand Up @@ -625,8 +655,12 @@ __global__ void __launch_bounds__(preprocess_block_size) gpuComputePageStringSiz
break;
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
case Encoding::PLAIN:
dict_size = static_cast<int32_t>(end - data);
str_bytes = is_bounds_pg ? totalPlainEntriesSize(data, dict_size, start_value, end_value)
: dict_size - sizeof(int) * (pp->num_input_values - pp->num_nulls);
if ((s->col.data_type & 7) == FIXED_LEN_BYTE_ARRAY) {
str_bytes = totalFixedLenByteArraySize(s->dtype_len_in, dict_size, start_value, end_value);
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
} else {
str_bytes = is_bounds_pg ? totalPlainEntriesSize(data, dict_size, start_value, end_value)
: dict_size - sizeof(int) * (pp->num_input_values - pp->num_nulls);
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
}
break;
}

Expand Down
5 changes: 3 additions & 2 deletions cpp/src/io/parquet/parquet_gpu.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -429,8 +429,9 @@ struct EncPage {
*/
constexpr bool is_string_col(ColumnChunkDesc const& chunk)
{
return (chunk.data_type & 7) == BYTE_ARRAY and (chunk.data_type >> 3) != 4 and
chunk.converted_type != DECIMAL;
return ((chunk.data_type & 7) == BYTE_ARRAY and (chunk.data_type >> 3) != 4 and
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
chunk.converted_type != DECIMAL) or
(chunk.data_type & 7) == FIXED_LEN_BYTE_ARRAY;
}

/**
Expand Down