Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON single quote normalization API #14729

Merged
merged 40 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
a1120f6
single quote normalization api
shrshi Jan 9, 2024
c6b0ba3
test for normalization api
shrshi Jan 10, 2024
d9a8acf
fixes to test
shrshi Jan 11, 2024
cfe89e6
fix to tests
shrshi Jan 11, 2024
b2ce13b
pre-commit formatting fixes
shrshi Jan 11, 2024
2134cf8
finally, the test passes
shrshi Jan 11, 2024
04e9d82
try again with test stream
shrshi Jan 11, 2024
9f53d42
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 11, 2024
907aba9
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 11, 2024
fa11424
added option to normalize single quotes in read_json
shrshi Jan 12, 2024
adcbddf
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 12, 2024
2e86d89
formatting fixes
shrshi Jan 12, 2024
9925c10
adding testing_main
shrshi Jan 13, 2024
2838c74
java bindings
shrshi Jan 13, 2024
2313955
formatting fixes
shrshi Jan 13, 2024
a5bb42e
compile fix
shrshi Jan 13, 2024
3a6f267
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 13, 2024
0926a2f
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 16, 2024
e63bca0
Update java/src/test/java/ai/rapids/cudf/TableTest.java
shrshi Jan 16, 2024
005b5c2
Update java/src/test/java/ai/rapids/cudf/TableTest.java
shrshi Jan 16, 2024
1a8f5f3
added an error test for when normalize quotes is not enabled
shrshi Jan 16, 2024
a999ca4
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 16, 2024
6a151f5
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 17, 2024
2001866
addressing PR reviews; adding comments
shrshi Jan 18, 2024
b30e130
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 18, 2024
d0fefbd
moved tests; removed duplicated fst code
shrshi Jan 18, 2024
7520e03
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 18, 2024
55503e3
moved preprocess step to read_json
shrshi Jan 18, 2024
85e8053
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 18, 2024
a885277
PR reviews - modifiable input buffer in normalize quotes parameter
shrshi Jan 19, 2024
64135df
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 19, 2024
de1f1b3
don't need fully qualified name in enclosing namespace
shrshi Jan 20, 2024
df6c0f3
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 20, 2024
8441b39
header files cleanup; more fully-qualified names cleanup
shrshi Jan 22, 2024
d5b9707
alphabetizing the new file in add_library
shrshi Jan 23, 2024
4e358fd
more header file cleanup
shrshi Jan 23, 2024
a79683d
guiding the consts eastwards
shrshi Jan 23, 2024
bcc2285
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 23, 2024
890d09b
formatting fix
shrshi Jan 23, 2024
ace46d3
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,7 @@ add_library(
src/io/json/legacy/json_gpu.cu
src/io/json/legacy/reader_impl.cu
src/io/json/write_json.cu
src/io/json/json_quote_normalization.cu
shrshi marked this conversation as resolved.
Show resolved Hide resolved
src/io/orc/aggregate_orc_metadata.cpp
src/io/orc/dict_enc.cu
src/io/orc/orc.cpp
Expand Down
7 changes: 6 additions & 1 deletion cpp/include/cudf/io/detail/json.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
* Copyright (c) 2020-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -51,4 +51,9 @@ void write_json(data_sink* sink,
json_writer_options const& options,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

std::unique_ptr<rmm::device_uvector<char>> normalize_quotes(
const cudf::device_span<std::byte>& inbuf,
shrshi marked this conversation as resolved.
Show resolved Hide resolved
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);
} // namespace cudf::io::json::detail
208 changes: 208 additions & 0 deletions cpp/src/io/json/json_quote_normalization.cu
vuule marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <io/fst/lookup_tables.cuh>
#include <io/utilities/hostdevice_vector.hpp>

#include <cudf/io/detail/json.hpp>
#include <cudf/scalar/scalar_factories.hpp>
#include <cudf/strings/repeat_strings.hpp>
#include <cudf/types.hpp>

#include <rmm/cuda_stream.hpp>
#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_buffer.hpp>
#include <rmm/device_uvector.hpp>

#include <thrust/iterator/discard_iterator.h>

#include <cstdlib>
#include <string>
#include <vector>

namespace cudf::io::json {

using SymbolT = char;
using StateT = char;
using SymbolOffsetT = uint32_t;

namespace normalize_quotes {

// Type sufficiently large to index symbols within the input and output (may be unsigned)
enum class dfa_states : char { TT_OOS = 0U, TT_DQS, TT_SQS, TT_DEC, TT_SEC, TT_NUM_STATES };
shrshi marked this conversation as resolved.
Show resolved Hide resolved
enum class dfa_symbol_group_id : uint32_t {
DOUBLE_QUOTE_CHAR, ///< Quote character SG: "
SINGLE_QUOTE_CHAR, ///< Quote character SG: '
ESCAPE_CHAR, ///< Escape character SG: '\'
NEWLINE_CHAR, ///< Newline character SG: '\n'
OTHER_SYMBOLS, ///< SG implicitly matching all other characters
NUM_SYMBOL_GROUPS ///< Total number of symbol groups
};

// Aliases for readability of the transition table
constexpr auto TT_OOS = dfa_states::TT_OOS;
constexpr auto TT_DQS = dfa_states::TT_DQS;
constexpr auto TT_SQS = dfa_states::TT_SQS;
constexpr auto TT_DEC = dfa_states::TT_DEC;
constexpr auto TT_SEC = dfa_states::TT_SEC;
constexpr auto TT_NUM_STATES = static_cast<char>(dfa_states::TT_NUM_STATES);
constexpr auto NUM_SYMBOL_GROUPS = static_cast<uint32_t>(dfa_symbol_group_id::NUM_SYMBOL_GROUPS);

// The i-th string representing all the characters of a symbol group
std::array<std::vector<SymbolT>, NUM_SYMBOL_GROUPS - 1> const qna_sgs{
{{'\"'}, {'\''}, {'\\'}, {'\n'}}};

// Transition table
std::array<std::array<dfa_states, NUM_SYMBOL_GROUPS>, TT_NUM_STATES> const qna_state_tt{{
/* IN_STATE " ' \ \n OTHER */
/* TT_OOS */ {{TT_DQS, TT_SQS, TT_OOS, TT_OOS, TT_OOS}},
/* TT_DQS */ {{TT_OOS, TT_DQS, TT_DEC, TT_OOS, TT_DQS}},
/* TT_SQS */ {{TT_SQS, TT_OOS, TT_SEC, TT_OOS, TT_SQS}},
/* TT_DEC */ {{TT_DQS, TT_DQS, TT_DQS, TT_OOS, TT_DQS}},
/* TT_SEC */ {{TT_SQS, TT_SQS, TT_SQS, TT_OOS, TT_SQS}},
}};

// The DFA's starting state
constexpr char start_state = static_cast<char>(TT_OOS);

struct TransduceToNormalizedQuotes {
/**
* @brief Returns the <relative_offset>-th output symbol on the transition (state_id, match_id).
*/
template <typename StateT, typename SymbolGroupT, typename RelativeOffsetT, typename SymbolT>
constexpr CUDF_HOST_DEVICE SymbolT operator()(StateT const state_id,
SymbolGroupT const match_id,
RelativeOffsetT const relative_offset,
SymbolT const read_symbol) const
{
// -------- TRANSLATION TABLE ------------
// Let the alphabet set be Sigma
// ---------------------------------------
// ---------- NON-SPECIAL CASES: ----------
// Output symbol same as input symbol <s>
// state | read_symbol <s> -> output_symbol <s>
// DQS | Sigma -> Sigma
// DEC | Sigma -> Sigma
// OOS | Sigma\{'} -> Sigma\{'}
// SQS | Sigma\{', "} -> Sigma\{', "}
// ---------- SPECIAL CASES: --------------
// Input symbol translates to output symbol
// OOS | {'} -> {"}
// SQS | {'} -> {"}
// SQS | {"} -> {\"}
// SQS | {\} -> <nop>
// SEC | {'} -> {'}
// SEC | Sigma\{'} -> {\*}

// Whether this transition translates to the escape sequence: \"
const bool outputs_escape_sequence =
shrshi marked this conversation as resolved.
Show resolved Hide resolved
(state_id == static_cast<StateT>(dfa_states::TT_SQS)) &&
(match_id == static_cast<SymbolGroupT>(dfa_symbol_group_id::DOUBLE_QUOTE_CHAR));
// Case when a double quote needs to be replaced by the escape sequence: \"
if (outputs_escape_sequence) { return (relative_offset == 0) ? '\\' : '"'; }
// Case when a single quote needs to be replaced by a double quote
if ((match_id == static_cast<SymbolGroupT>(dfa_symbol_group_id::SINGLE_QUOTE_CHAR)) &&
((state_id == static_cast<StateT>(dfa_states::TT_SQS)) ||
(state_id == static_cast<StateT>(dfa_states::TT_OOS)))) {
return '"';
}
// Case when the read symbol is an escape character - the actual translation for \<s> for some
// symbol <s> is handled by transitions from SEC. For now, there is no output for this
// transition
if ((match_id == static_cast<SymbolGroupT>(dfa_symbol_group_id::ESCAPE_CHAR)) &&
((state_id == static_cast<StateT>(dfa_states::TT_SQS)))) {
return 0;
}
// Case when an escaped single quote in an input single-quoted string needs to be replaced by an
// unescaped single quote
if ((match_id == static_cast<SymbolGroupT>(dfa_symbol_group_id::SINGLE_QUOTE_CHAR)) &&
((state_id == static_cast<StateT>(dfa_states::TT_SEC)))) {
return '\'';
}
// Case when an escaped symbol <s> that is not a single-quote needs to be replaced with \<s>
if (state_id == static_cast<StateT>(dfa_states::TT_SEC)) {
return (relative_offset == 0) ? '\\' : read_symbol;
}
// In all other cases we simply output the input symbol
return read_symbol;
}

/**
* @brief Returns the number of output characters for a given transition. During quote
* normalization, we always emit one output character (i.e., either the input character or the
* single quote-input replaced by a double quote), except when we need to escape a double quote
* that was previously inside a single-quoted string.
*/
template <typename StateT, typename SymbolGroupT, typename SymbolT>
constexpr CUDF_HOST_DEVICE int32_t operator()(StateT const state_id,
SymbolGroupT const match_id,
SymbolT const read_symbol) const
{
// Whether this transition translates to the escape sequence: \"
const bool sqs_outputs_escape_sequence =
(state_id == static_cast<StateT>(dfa_states::TT_SQS)) &&
(match_id == static_cast<SymbolGroupT>(dfa_symbol_group_id::DOUBLE_QUOTE_CHAR));
// Number of characters to output on this transition
if (sqs_outputs_escape_sequence) { return 2; }
// Whether this transition translates to the escape sequence \<s> or unescaped '
const bool sec_outputs_escape_sequence =
(state_id == static_cast<StateT>(dfa_states::TT_SEC)) &&
(match_id != static_cast<SymbolGroupT>(dfa_symbol_group_id::SINGLE_QUOTE_CHAR));
// Number of characters to output on this transition
if (sec_outputs_escape_sequence) { return 2; }
// Whether this transition translates to no output <nop>
const bool sqs_outputs_nop =
(state_id == static_cast<StateT>(dfa_states::TT_SQS)) &&
(match_id == static_cast<SymbolGroupT>(dfa_symbol_group_id::ESCAPE_CHAR));
// Number of characters to output on this transition
if (sqs_outputs_nop) { return 0; }
return 1;
}
};

} // namespace normalize_quotes

namespace detail {

std::unique_ptr<rmm::device_uvector<char>> normalize_quotes(
const cudf::device_span<std::byte>& inbuf,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
auto parser = cudf::io::fst::detail::make_fst(
cudf::io::fst::detail::make_symbol_group_lut(cudf::io::json::normalize_quotes::qna_sgs),
cudf::io::fst::detail::make_transition_table(cudf::io::json::normalize_quotes::qna_state_tt),
cudf::io::fst::detail::make_translation_functor(
cudf::io::json::normalize_quotes::TransduceToNormalizedQuotes{}),
stream);

std::unique_ptr<rmm::device_uvector<char>> outbuf_ptr =
std::make_unique<rmm::device_uvector<char>>(inbuf.size() * 2, stream, mr);
rmm::device_scalar<SymbolOffsetT> outbuf_size(stream, mr);
parser.Transduce(reinterpret_cast<char*>(inbuf.data()),
static_cast<SymbolOffsetT>(inbuf.size()),
outbuf_ptr->data(),
thrust::make_discard_iterator(),
outbuf_size.data(),
cudf::io::json::normalize_quotes::start_state,
stream);

outbuf_ptr->resize(outbuf_size.value(stream), stream);
return outbuf_ptr;
}

} // namespace detail
} // namespace cudf::io::json
4 changes: 2 additions & 2 deletions cpp/src/io/json/read_json.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
* Copyright (c) 2022-2024, NVIDIA CORPORATION.
shrshi marked this conversation as resolved.
Show resolved Hide resolved
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -22,6 +22,7 @@
#include <cudf/utilities/span.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_uvector.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>

#include <memory>
Expand All @@ -41,5 +42,4 @@ size_type find_first_delimiter_in_chunk(host_span<std::unique_ptr<cudf::io::data
json_reader_options const& reader_opts,
char const delimiter,
rmm::cuda_stream_view stream);

} // namespace cudf::io::json::detail
1 change: 1 addition & 0 deletions cpp/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,7 @@ ConfigureTest(JSON_TYPE_CAST_TEST io/json_type_cast_test.cu)
ConfigureTest(NESTED_JSON_TEST io/nested_json_test.cpp io/json_tree.cpp)
ConfigureTest(ARROW_IO_SOURCE_TEST io/arrow_io_source_test.cpp)
ConfigureTest(MULTIBYTE_SPLIT_TEST io/text/multibyte_split_test.cpp)
ConfigureTest(JSON_QUOTE_NORMALIZATION io/json_quote_normalization_test.cpp)
ConfigureTest(
DATA_CHUNK_SOURCE_TEST io/text/data_chunk_source_test.cpp
GPUS 1
Expand Down
86 changes: 86 additions & 0 deletions cpp/tests/io/json_quote_normalization_test.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cudf/io/detail/json.hpp>
#include <cudf/io/json.hpp>
#include <cudf/scalar/scalar.hpp>
#include <cudf/scalar/scalar_factories.hpp>
#include <cudf/table/table.hpp>
#include <cudf/table/table_view.hpp>
#include <cudf/types.hpp>
#include <cudf/utilities/span.hpp>
#include <cudf/utilities/type_dispatcher.hpp>

#include <cudf_test/base_fixture.hpp>
#include <cudf_test/cudf_gtest.hpp>
#include <cudf_test/default_stream.hpp>
#include <cudf_test/table_utilities.hpp>

#include <rmm/device_uvector.hpp>
#include <rmm/mr/device/cuda_memory_resource.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>

#include <string>

// Base test fixture for tests
struct JsonNormalizationTest : public cudf::test::BaseFixture {};

TEST_F(JsonNormalizationTest, ValidOutput)
{
// RMM memory resource
std::shared_ptr<rmm::mr::device_memory_resource> rsc =
std::make_shared<rmm::mr::cuda_memory_resource>();

// Test input
std::string const host_input = R"({"A":'TEST"'})";
rmm::device_uvector<char> device_input(
host_input.size(), cudf::test::get_default_stream(), rsc.get());
for (size_t i = 0; i < host_input.size(); i++)
device_input.set_element_async(i, host_input[i], cudf::test::get_default_stream());
auto device_input_span = cudf::device_span<std::byte>(
reinterpret_cast<std::byte*>(device_input.data()), device_input.size());

// Preprocessing FST
auto device_fst_output_ptr = cudf::io::json::detail::normalize_quotes(
device_input_span, cudf::test::get_default_stream(), rsc.get());
/*
for(size_t i = 0; i < device_fst_output_ptr->size(); i++)
std::printf("%c", device_fst_output_ptr->element(i, cudf::test::get_default_stream()));
std::printf("\n");
*/

// Initialize parsing options (reading json lines)
auto device_fst_output_span = cudf::device_span<std::byte>(
reinterpret_cast<std::byte*>(device_fst_output_ptr->data()), device_fst_output_ptr->size());
cudf::io::json_reader_options input_options =
cudf::io::json_reader_options::builder(cudf::io::source_info{device_fst_output_span})
.lines(true);

cudf::io::table_with_metadata processed_table =
cudf::io::read_json(input_options, cudf::test::get_default_stream(), rsc.get());

// Expected table
std::string const expected_input = R"({"A":"TEST\""})";
cudf::io::json_reader_options expected_input_options =
cudf::io::json_reader_options::builder(
cudf::io::source_info{expected_input.data(), expected_input.size()})
.lines(true);
cudf::io::table_with_metadata expected_table =
cudf::io::read_json(expected_input_options, cudf::test::get_default_stream(), rsc.get());
CUDF_TEST_EXPECT_TABLES_EQUAL(expected_table.tbl->view(), processed_table.tbl->view());
}

CUDF_TEST_PROGRAM_MAIN()
Loading