Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic type inference for cuIO #11121

Merged
merged 78 commits into from
Sep 15, 2022
Merged
Show file tree
Hide file tree
Changes from 72 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
5b4a6fc
Squashed with initial test set
elstehle Mar 28, 2022
377358a
style fix & additional test scenario
elstehle Mar 29, 2022
4186004
removed forceinline
elstehle Mar 29, 2022
a921c66
tagging host device function
elstehle Mar 30, 2022
75a1853
Added utility to debug print & instrumented code to use it
elstehle Mar 31, 2022
a23668a
switched to using rmm also inside algorithm
elstehle Mar 31, 2022
aa5f5c4
header include order & SFINAE macro
elstehle Mar 31, 2022
4ee2253
debug print cleanups
elstehle Apr 4, 2022
0f35852
renaming key-value store op to stack_op
elstehle Apr 4, 2022
ca5d465
device_span
elstehle Apr 4, 2022
f5960bd
addressing review comments & minor cleanups
elstehle Apr 6, 2022
80226b7
error on unsupported unsigned_t and fixed typos
elstehle Apr 7, 2022
e8bc8a5
minor style changes addressing review comments
elstehle Apr 13, 2022
c5274b5
squashed with bracket/brace test
elstehle Apr 11, 2022
bb16254
clean up & addressing review comments
elstehle Apr 20, 2022
4e42d0e
refactored lookup tables
elstehle Apr 25, 2022
e439320
put lookup tables into their own cudf file
elstehle Apr 25, 2022
05840b3
Change interface for FST to not need temp storage
elstehle Apr 27, 2022
6da9360
removing unused var post-cleanup
elstehle May 4, 2022
702dfa1
unified usage of pragma unrolls
elstehle May 9, 2022
26a39ea
Adding hostdevice macros to in-reg array
elstehle May 9, 2022
8c685c0
making const vars const
elstehle May 9, 2022
5c94521
refactor lut sanity check
elstehle May 9, 2022
03b2c20
rebase on latest FST
elstehle May 3, 2022
ff22f19
squash & rebase on latest tokenizer version
elstehle May 13, 2022
419d3c8
Add type inference test in CMake
PointKernel Jun 17, 2022
d2e2a0b
Merge remote-tracking branch 'upstream/branch-22.08' into type-inference
PointKernel Jun 17, 2022
f32b9b9
Add type inference prototype and basic test
PointKernel Jun 17, 2022
35bcde7
Code formatting
PointKernel Jun 17, 2022
a76d7bc
Merge remote-tracking branch 'upstream/branch-22.08' into type-inference
PointKernel Jul 25, 2022
29c6525
Merge remote-tracking branch 'upstream/branch-22.10' into type-inference
PointKernel Aug 19, 2022
a659817
Move type inference to utilities
PointKernel Aug 19, 2022
4410488
Resolve conflicts + relocate type inference test file
PointKernel Aug 19, 2022
6409a5f
Get rid of narrow conversion + add string handling
PointKernel Aug 19, 2022
ec07bca
Updates: make column string iter compatible with zip iterator
PointKernel Aug 19, 2022
640eb00
Minor updates
PointKernel Aug 19, 2022
b0fac83
Add missing header
PointKernel Aug 19, 2022
67fcaf5
Fix the infinite loop bug with while
PointKernel Aug 19, 2022
51997be
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Aug 20, 2022
955f480
Merge remote-tracking branch 'upstream/branch-22.10' into type-inference
PointKernel Aug 30, 2022
cc1a04c
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Aug 30, 2022
756a3e2
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Aug 30, 2022
8961126
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Aug 30, 2022
bb69c14
Minor doc updates
PointKernel Aug 30, 2022
baff56f
Merge branch 'type-inference' of github.com:PointKernel/cudf into typ…
PointKernel Aug 30, 2022
ec593ba
Update docs
PointKernel Aug 30, 2022
2edccc8
Merge remote-tracking branch 'upstream/branch-22.10' into type-inference
PointKernel Sep 6, 2022
b461fd8
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Sep 6, 2022
c2b0393
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Sep 6, 2022
529d5f2
Cleanups: comment, remove unused header + use const var
PointKernel Sep 6, 2022
630c94c
Merge branch 'type-inference' of github.com:PointKernel/cudf into typ…
PointKernel Sep 6, 2022
9f4907c
Code formatting
PointKernel Sep 6, 2022
be98e84
Use fixed-width integers
PointKernel Sep 6, 2022
f864f19
Add omission null count handling
PointKernel Sep 6, 2022
ea8493a
Update tests
PointKernel Sep 6, 2022
5b46d33
Add null test
PointKernel Sep 6, 2022
0d81a33
Add more tests
PointKernel Sep 6, 2022
e3bb216
Cleanups: renaming, const parameter + namespace
PointKernel Sep 7, 2022
b71f665
Code cleanup
PointKernel Sep 7, 2022
0a56800
Cleanup: use parse_options consistently
PointKernel Sep 7, 2022
07b2fff
Update cpp/tests/io/type_inference_test.cu
PointKernel Sep 9, 2022
5ecff99
Revert ommited null changes
PointKernel Sep 9, 2022
9bfbf38
Merge branch 'type-inference' of github.com:PointKernel/cudf into typ…
PointKernel Sep 9, 2022
ee588dd
Minor cleanups
PointKernel Sep 9, 2022
be717a7
Add all null test
PointKernel Sep 9, 2022
168efdd
Renaming json inference options view struct
PointKernel Sep 9, 2022
eb930a2
Minor improvement: use block reduce to minimize global atomic
PointKernel Sep 9, 2022
6e5062f
Minor cleanup
PointKernel Sep 9, 2022
223cd37
Update cpp/src/io/utilities/type_inference.cuh
PointKernel Sep 9, 2022
13fb10d
Code formatting
PointKernel Sep 9, 2022
ee65f28
Treat all date-like input as string
PointKernel Sep 12, 2022
0a7bdfa
Add invalid input test
PointKernel Sep 12, 2022
fb9ac99
Reinforce string condition
PointKernel Sep 14, 2022
c595ac0
Use per-thread histogram with custom sum reduction
PointKernel Sep 14, 2022
107c0cc
Use string scalar instead of char array
PointKernel Sep 14, 2022
80840b1
Add default member initializer to column_type_histogram
PointKernel Sep 14, 2022
8deec54
Minor updates
PointKernel Sep 14, 2022
1a79f37
Merge remote-tracking branch 'upstream/branch-22.10' into type-inference
PointKernel Sep 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cpp/src/io/fst/agent_dfa.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -422,7 +422,7 @@ struct AgentDFA {
OffsetT const num_total_symbols,
StateIndexT& state,
CallbackOpT& callback_op,
cub::Int2Type<BYPASS_LOAD> /**/)
cub::Int2Type<BYPASS_LOAD>)
{
using StateTransitionOpT = StateTransitionOp<CallbackOpT, TransitionTableT>;

Expand Down
18 changes: 18 additions & 0 deletions cpp/src/io/utilities/parsing_utils.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,16 @@ using cudf::device_span;
namespace cudf {
namespace io {

/**
* @brief Non-owning view for json type inference options
*/
struct json_inference_options_view {
char quote_char;
cudf::detail::trie_view trie_true;
cudf::detail::trie_view trie_false;
cudf::detail::trie_view trie_na;
};

/**
* @brief Structure for holding various options used when parsing and
* converting CSV/json data to cuDF data type values.
Expand Down Expand Up @@ -79,6 +89,14 @@ struct parse_options {
cudf::detail::optional_trie trie_na;
bool multi_delimiter;

[[nodiscard]] json_inference_options_view json_view() const
{
return {quotechar,
cudf::detail::make_trie_view(trie_true),
cudf::detail::make_trie_view(trie_false),
cudf::detail::make_trie_view(trie_na)};
}

[[nodiscard]] parse_options_view view() const
{
return {delimiter,
Expand Down
295 changes: 295 additions & 0 deletions cpp/src/io/utilities/type_inference.cuh
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <io/utilities/column_type_histogram.hpp>
#include <io/utilities/parsing_utils.cuh>
#include <io/utilities/trie.cuh>

#include <cudf/types.hpp>
#include <cudf/utilities/error.hpp>
#include <cudf/utilities/span.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_scalar.hpp>

#include <thrust/distance.h>
#include <thrust/tuple.h>

#include <cub/block/block_reduce.cuh>

#include <cstddef>

namespace cudf::io::detail {
/**
* @brief Returns true if the input character is a valid digit.
* Supports both decimal and hexadecimal digits (uppercase and lowercase).
*
* @param c Character to check
* @param is_hex Whether to check as a hexadecimal
*
* @return `true` if it is digit-like, `false` otherwise
*/
__device__ __inline__ bool is_digit(char const c, bool const is_hex = false)
{
if (c >= '0' && c <= '9') return true;

if (is_hex) {
if (c >= 'A' && c <= 'F') return true;
if (c >= 'a' && c <= 'f') return true;
}

return false;
}

/**
* @brief Returns true if the counters indicate a potentially valid float.
* False positives are possible because positions are not taken into account.
* For example, field "e.123-" would match the pattern.
*/
__device__ __inline__ bool is_like_float(std::size_t len,
uint32_t digit_cnt,
uint32_t decimal_cnt,
uint32_t dash_cnt,
uint32_t exponent_cnt)
{
// Can't have more than one exponent and one decimal point
if (decimal_cnt > 1) return false;
if (exponent_cnt > 1) return false;
// Without the exponent or a decimal point, this is an integer, not a float
if (decimal_cnt == 0 && exponent_cnt == 0) return false;

// Can only have one '-' per component
if (dash_cnt > 1 + exponent_cnt) return false;

// If anything other than these characters is present, it's not a float
if (digit_cnt + decimal_cnt + dash_cnt + exponent_cnt != len) return false;

// Needs at least 1 digit, 2 if exponent is present
if (digit_cnt < 1 + exponent_cnt) return false;

return true;
}

/**
* @brief Constructs column type histogram for a given column string input `data`.
*
* @tparam BlockSize Number of threads in each block
* @tparam OptionsView Type of inference options view
* @tparam ColumnStringIter Iterator type whose `value_type` is a
* `thrust::tuple<offset_t, length_t>`, where `offset_t` and `length_t` are of integral type and
* `offset_t` needs to be convertible to `std::size_t`.
*
* @param[in] options View of inference options
* @param[in] data JSON string input
* @param[in] column_strings_begin The begining of an offset-length tuple sequence
* @param[in] size Size of the string input
* @param[out] column_info Histogram of column type counters
*/
template <int BlockSize, typename OptionsView, typename ColumnStringIter>
__global__ void infer_column_type_kernel(OptionsView options,
device_span<char const> data,
ColumnStringIter column_strings_begin,
std::size_t size,
cudf::io::column_type_histogram* column_info)
{
cudf::size_type null_count = 0;
cudf::size_type string_count = 0;
cudf::size_type bool_count = 0;
cudf::size_type float_count = 0;
PointKernel marked this conversation as resolved.
Show resolved Hide resolved

for (auto idx = threadIdx.x + blockDim.x * blockIdx.x; idx < size;
idx += gridDim.x * blockDim.x) {
auto const field_offset = thrust::get<0>(*(column_strings_begin + idx));
auto const field_len = thrust::get<1>(*(column_strings_begin + idx));
auto const field_begin = data.begin() + field_offset;

if (cudf::detail::serialized_trie_contains(
options.trie_na, {field_begin, static_cast<std::size_t>(field_len)})) {
++null_count;
continue;
}

// Handling strings
if (*field_begin == options.quote_char && field_begin[field_len - 1] == options.quote_char) {
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
++string_count;
continue;
}

uint32_t digit_count = 0;
uint32_t decimal_count = 0;
uint32_t slash_count = 0;
uint32_t dash_count = 0;
uint32_t plus_count = 0;
uint32_t colon_count = 0;
uint32_t exponent_count = 0;
uint32_t other_count = 0;

auto const maybe_hex =
(field_len > 2 && field_begin[0] == '0' && field_begin[1] == 'x') ||
(field_len > 3 && field_begin[0] == '-' && field_begin[1] == '0' && field_begin[2] == 'x');
auto const field_end = field_begin + field_len;

for (auto pos = field_begin; pos < field_end; ++pos) {
if (is_digit(*pos, maybe_hex)) {
digit_count++;
continue;
}
// Looking for unique characters that will help identify column types
switch (*pos) {
case '.': decimal_count++; break;
case '-': dash_count++; break;
case '+': plus_count++; break;
case '/': slash_count++; break;
case ':': colon_count++; break;
case 'e':
case 'E':
if (!maybe_hex && pos > field_begin && pos < field_end - 1) exponent_count++;
break;
default: other_count++; break;
}
}

// All characters must be digits in an integer, except for the starting sign and 'x' in the
// hexadecimal prefix
auto const int_req_number_cnt =
static_cast<uint32_t>(field_len) -
((*field_begin == '-' || *field_begin == '+') && field_len > 1) - maybe_hex;
if (cudf::detail::serialized_trie_contains(
options.trie_true, {field_begin, static_cast<std::size_t>(field_len)}) ||
cudf::detail::serialized_trie_contains(
options.trie_false, {field_begin, static_cast<std::size_t>(field_len)})) {
++bool_count;
} else if (digit_count == int_req_number_cnt) {
auto const is_negative = (*field_begin == '-');
char const* data_begin = field_begin + (is_negative || (*field_begin == '+'));
cudf::size_type* ptr = cudf::io::gpu::infer_integral_field_counter(
data_begin, data_begin + digit_count, is_negative, *column_info);
atomicAdd(ptr, 1);
} else if (is_like_float(
field_len, digit_count, decimal_count, dash_count + plus_count, exponent_count)) {
++float_count;
}
// All invalid JSON values are treated as string
else {
++string_count;
}
} // grid-stride for loop

using BlockReduce = cub::BlockReduce<cudf::size_type, BlockSize>;
__shared__ typename BlockReduce::TempStorage temp_storage;
auto block_null_count = BlockReduce(temp_storage).Sum(null_count);
auto block_string_count = BlockReduce(temp_storage).Sum(string_count);
auto block_bool_count = BlockReduce(temp_storage).Sum(bool_count);
auto block_float_count = BlockReduce(temp_storage).Sum(float_count);
if (threadIdx.x == 0) {
atomicAdd(&column_info->null_count, block_null_count);
atomicAdd(&column_info->string_count, block_string_count);
atomicAdd(&column_info->bool_count, block_bool_count);
atomicAdd(&column_info->float_count, block_float_count);
}
}

/**
* @brief Constructs column type histogram for a given column string input `data`.
*
* @tparam OptionsView Type of inference options view
* @tparam ColumnStringIter Iterator type whose `value_type` is a
* `thrust::tuple<offset_t, length_t>`, where `offset_t` and `length_t` are of integral type and
* `offset_t` needs to be convertible to `std::size_t`.
*
* @param options View of inference options
* @param data JSON string input
* @param column_strings_begin The begining of an offset-length tuple sequence
* @param size Size of the string input
* @param stream CUDA stream used for device memory operations and kernel launches
* @return A histogram containing column-specific type counters
*/
template <typename OptionsView, typename ColumnStringIter>
cudf::io::column_type_histogram infer_column_type(OptionsView const& options,
cudf::device_span<char const> data,
ColumnStringIter column_strings_begin,
std::size_t const size,
rmm::cuda_stream_view stream)
{
constexpr int block_size = 128;

auto const grid_size = (size + block_size - 1) / block_size;
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
auto d_column_info = rmm::device_scalar<cudf::io::column_type_histogram>(stream);
CUDF_CUDA_TRY(cudaMemsetAsync(
d_column_info.data(), 0, sizeof(cudf::io::column_type_histogram), stream.value()));

infer_column_type_kernel<block_size><<<grid_size, block_size, 0, stream.value()>>>(
options, data, column_strings_begin, size, d_column_info.data());

return d_column_info.value(stream);
}

/**
* @brief Infers data type for a given JSON string input `data`.
*
* @throw cudf::logic_error if input size is 0
* @throw cudf::logic_error if date time is not inferred as string
* @throw cudf::logic_error if data type inference failed
*
* @tparam OptionsView Type of inference options view
* @tparam ColumnStringIter Iterator type whose `value_type` is convertible to
* `thrust::tuple<device_span, string_view>`
*
* @param options View of inference options
* @param data JSON string input
* @param column_strings_begin The begining of an offset-length tuple sequence
* @param size Size of the string input
* @param stream CUDA stream used for device memory operations and kernel launches
* @return The inferred data type
*/
template <typename OptionsView, typename ColumnStringIter>
cudf::data_type infer_data_type(OptionsView const& options,
device_span<char const> data,
ColumnStringIter column_strings_begin,
std::size_t const size,
rmm::cuda_stream_view stream)
{
CUDF_EXPECTS(size != 0, "No data available for data type inference.\n");

auto const h_column_info = infer_column_type(options, data, column_strings_begin, size, stream);

auto get_type_id = [&](auto const& cinfo) {
auto int_count_total =
cinfo.big_int_count + cinfo.negative_small_int_count + cinfo.positive_small_int_count;
if (cinfo.null_count == static_cast<cudf::size_type>(size)) {
// Entire column is NULL; allocate the smallest amount of memory
return type_id::INT8;
} else if (cinfo.string_count > 0) {
return type_id::STRING;
} else if (cinfo.datetime_count > 0) {
CUDF_FAIL("Date time is inferred as string.\n");
} else if (cinfo.float_count > 0 || (int_count_total > 0 && cinfo.null_count > 0)) {
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
return type_id::FLOAT64;
} else if (cinfo.big_int_count == 0 && int_count_total != 0) {
return type_id::INT64;
} else if (cinfo.big_int_count != 0 && cinfo.negative_small_int_count != 0) {
return type_id::STRING;
} else if (cinfo.big_int_count != 0) {
return type_id::UINT64;
} else if (cinfo.bool_count > 0) {
return type_id::BOOL8;
}
CUDF_FAIL("Data type inference failed.\n");
};
return cudf::data_type{get_type_id(h_column_info)};
}
} // namespace cudf::io::detail
1 change: 1 addition & 0 deletions cpp/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,7 @@ ConfigureTest(MULTIBYTE_SPLIT_TEST io/text/multibyte_split_test.cpp)
ConfigureTest(DATA_CHUNK_SOURCE_TEST io/text/data_chunk_source_test.cpp)
ConfigureTest(LOGICAL_STACK_TEST io/fst/logical_stack_test.cu)
ConfigureTest(FST_TEST io/fst/fst_test.cu)
ConfigureTest(TYPE_INFERENCE_TEST io/type_inference_test.cu)
if(CUDF_ENABLE_ARROW_S3)
target_compile_definitions(ARROW_IO_SOURCE_TEST PRIVATE "S3_ENABLED")
endif()
Expand Down
Loading