Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Nested Json benchmark #11466

Merged
merged 88 commits into from
Sep 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
0557d41
squashed with bracket/brace test
elstehle Apr 11, 2022
355d1e4
clean up & addressing review comments
elstehle Apr 20, 2022
39a6b65
refactored lookup tables
elstehle Apr 25, 2022
239f138
put lookup tables into their own cudf file
elstehle Apr 25, 2022
39cff80
Change interface for FST to not need temp storage
elstehle Apr 27, 2022
e24a133
removing unused var post-cleanup
elstehle May 4, 2022
caf6195
unified usage of pragma unrolls
elstehle May 9, 2022
ea79a81
Adding hostdevice macros to in-reg array
elstehle May 9, 2022
17dcbfd
making const vars const
elstehle May 9, 2022
6fdd24a
refactor lut sanity check
elstehle May 9, 2022
eccf970
fixes sg-count & uses rmm stream in fst tests
elstehle Jun 2, 2022
9fe8e4b
minor doxygen fix
elstehle Jun 14, 2022
694a365
adopts suggested fst test changes
elstehle Jun 15, 2022
f656f49
adopts device-side test data gen
elstehle Jul 7, 2022
485a1c6
adopts c++17 namespaces declarations
elstehle Jul 9, 2022
5f1c4b5
removes state vector-wrapper in favor of vanilla array
elstehle Jul 11, 2022
e6f8def
some west-const remainders & unifies StateIndexT
elstehle Jul 11, 2022
a798852
adds check for state transition narrowing conversion
elstehle Jul 11, 2022
eb24962
fixes logical stack test includes
elstehle Jul 12, 2022
f52e614
replaces enum with typed constexpr
elstehle Jul 14, 2022
3038058
adds excplitis error checking
elstehle Jul 14, 2022
d351e5c
addresses style review comments & fixes a todo
elstehle Jul 14, 2022
3f47952
replaces gtest asserts with expects
elstehle Jul 14, 2022
cba1619
fixes style in dispatch dfa
elstehle Jul 14, 2022
bea2a02
replaces vanilla loop with iota
elstehle Jul 15, 2022
8a184e9
rephrases documentation on in-reg array
elstehle Jul 16, 2022
78dd893
Merge remote-tracking branch 'upstream/branch-22.08' into feature/fin…
elstehle Jul 16, 2022
7a19f64
Merge remote-tracking branch 'upstream/branch-22.08' into feature/fin…
elstehle Jul 19, 2022
4783aae
improves style in fst test
elstehle Jul 20, 2022
6203709
adds comments in in_reg array
elstehle Jul 20, 2022
ad5817a
adds comments to lookup tables
elstehle Jul 20, 2022
dc55653
fixes formatting
elstehle Jul 20, 2022
378be9f
exchanges loops in favor of copy and fills
elstehle Jul 20, 2022
4ba5472
clarifies documentation in agent dfa
elstehle Jul 20, 2022
7980978
disambiguates transition and translation tables
elstehle Jul 20, 2022
2bce061
minor style fix
elstehle Jul 21, 2022
b37f716
if constexprs and doxy on DFA helper
elstehle Jul 21, 2022
d42869a
minor documentation fix
elstehle Jul 21, 2022
6c889f7
replaces loop for comparing vectors with generic macro
elstehle Jul 21, 2022
8a54c72
uses new vector comparison for logical stack test
elstehle Jul 21, 2022
cc1e135
Added utility to debug print & instrumented code to use it
elstehle Mar 31, 2022
7dba177
switched to using rmm also inside algorithm
elstehle Mar 31, 2022
ff7144a
renaming key-value store op to stack_op
elstehle Apr 4, 2022
61a76b7
device_span
elstehle Apr 4, 2022
c28e327
minor style changes addressing review comments
elstehle Apr 13, 2022
a2f27ae
squashed with bracket/brace test
elstehle Apr 11, 2022
fe4762d
refactored lookup tables
elstehle Apr 25, 2022
a064bdd
put lookup tables into their own cudf file
elstehle Apr 25, 2022
2c729c0
fixes sg-count & uses rmm stream in fst tests
elstehle Jun 2, 2022
dbefb6c
rebase on latest FST
elstehle May 3, 2022
d54f3e5
fixes breaking changes from dependent-FST-PR
elstehle Jun 2, 2022
5fc3399
fixes for breaking downstream interface changes
elstehle Jul 13, 2022
6f65947
wraps if with stream params into detail ns
elstehle Jul 13, 2022
6ffc7f3
renames enums & moving from device_span to ptr params
elstehle Jul 14, 2022
0a7821e
fixes rebase conflicts
elstehle Jul 21, 2022
7396335
fixes escape sequence inside strings and field names and adds test fo…
elstehle Jul 21, 2022
6252208
adds comments on pda transition table states
elstehle Jul 21, 2022
191d71d
adopts new verification macro in test
elstehle Jul 22, 2022
4e99962
Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …
karthikeyann Jul 22, 2022
3b9a1ed
removes superfluous semicolons
elstehle Jul 22, 2022
632be35
rearranges token order in enum and adds documentation
elstehle Jul 23, 2022
3237772
uses namespace alias and switch to rmm stream in test
elstehle Jul 23, 2022
d2832b9
drops the gpu namespace
elstehle Jul 24, 2022
618ed3f
renames header file extension from h to hpp
elstehle Jul 24, 2022
19b37b7
squashed with minimal example
elstehle Jul 27, 2022
f015594
add parse_json_to_columns -> cudf::column
karthikeyann Jul 27, 2022
389e8e8
+ wraps dbg print in macro
elstehle Jul 27, 2022
42f7c4a
disables debug print by default
elstehle Jul 27, 2022
3e9db89
Merge remote-tracking branch 'origin/feature/json-to-columnar' into f…
elstehle Jul 27, 2022
ecf68fc
changeing interface of get_json_columns to also take device_span
elstehle Jul 28, 2022
93cbe1a
parsing to table_with_metadata
elstehle Jul 28, 2022
a1d8901
removes debug print examples
elstehle Jul 28, 2022
b9296d6
renames lists child col name to elements
elstehle Jul 28, 2022
5eddcc7
adds validity
elstehle Jul 28, 2022
cfcd7a1
fixes style
elstehle Jul 28, 2022
4c2ea7b
minor cleanup
karthikeyann Jul 28, 2022
8fc3adc
use device_uvector at few places
karthikeyann Jul 28, 2022
c1b9213
fixes metadata to match parquets metadata
elstehle Jul 30, 2022
e5b1ba6
Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …
karthikeyann Aug 1, 2022
fd02437
use make_device_uvector_async
karthikeyann Aug 1, 2022
f5810f8
Apply suggestions from code review (nvdbaranec)
karthikeyann Aug 1, 2022
342d3c3
added mr
karthikeyann Aug 1, 2022
f6a531a
add utf fail, pass test cases
karthikeyann Aug 2, 2022
2f22899
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
karthikeyann Aug 4, 2022
a141110
add NESTED_JSON_NVBENCH
karthikeyann Aug 4, 2022
188b140
Merge branch 'branch-22.10' of github.com:rapidsai/cudf into fea-json…
karthikeyann Aug 23, 2022
ba3483f
remove merge missed files
karthikeyann Aug 23, 2022
cad1060
address review comments
karthikeyann Aug 23, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,7 @@ ConfigureBench(
# * json benchmark -------------------------------------------------------------------
ConfigureBench(JSON_BENCH string/json.cu)
ConfigureNVBench(FST_NVBENCH io/fst.cu)
ConfigureNVBench(NESTED_JSON_NVBENCH io/json/nested_json.cpp)

# ##################################################################################################
# * io benchmark ---------------------------------------------------------------------
Expand Down
84 changes: 84 additions & 0 deletions cpp/benchmarks/io/json/nested_json.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmarks/common/generate_input.hpp>
#include <benchmarks/fixture/rmm_pool_raii.hpp>

#include <nvbench/nvbench.cuh>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file to be cu file in order to include this header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a hpp version of this file available?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only need to be .cu file if you are doing device calls.
Including a .cuh file is fine as long as you are not using any of the device functions (or data) in it.

Copy link
Member

@PointKernel PointKernel Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we place this include right before #include <cstdlib> since nvbench header is further than cudf ones?

karthikeyann marked this conversation as resolved.
Show resolved Hide resolved

#include <io/json/nested_json.hpp>

#include <tests/io/fst/common.hpp>

#include <cudf/scalar/scalar_factories.hpp>
#include <cudf/strings/repeat_strings.hpp>
#include <cudf/types.hpp>

#include <cstdlib>

namespace cudf {
namespace {
auto make_test_json_data(size_type string_size, rmm::cuda_stream_view stream)
{
// Test input
std::string input = R"(
{"a":1,"b":2,"c":[3], "d": {}},
{"a":1,"b":4.0,"c":[], "d": {"year":1882,"author": "Bharathi"}},
{"a":1,"b":6.0,"c":[5, 7], "d": null},
{"a":1,"b":null,"c":null},
{
"a" : 1
},
{"a":1,"b":Infinity,"c":[null], "d": {"year":-600,"author": "Kaniyan"}},
{"a": 1, "b": 8.0, "d": { "author": "Jean-Jacques Rousseau"}},)";

const size_type repeat_times = string_size / input.size();

auto d_input_scalar = cudf::make_string_scalar(input, stream);
auto& d_string_scalar = static_cast<cudf::string_scalar&>(*d_input_scalar);
auto d_scalar = cudf::strings::repeat_string(d_string_scalar, repeat_times);
auto& d_input = static_cast<cudf::scalar_type_t<std::string>&>(*d_scalar);

auto generated_json = std::string(d_input);
generated_json.front() = '[';
generated_json.back() = ']';
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
return generated_json;
}
} // namespace

void BM_NESTED_JSON(nvbench::state& state)
{
// TODO: to be replaced by nvbench fixture once it's ready
cudf::rmm_pool_raii rmm_pool;

auto const string_size{size_type(state.get_int64("string_size"))};

auto input = make_test_json_data(string_size, cudf::default_stream_value);
state.add_element_count(input.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this show/compute throughput?
Here is an example of nvbench throughput https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md#throughput-measurements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This is very helpful. 👍
I wanted to see overall chars/sec throughput only. It's difficult to track the number of global memory reads, write manually for this entire algorithm. so, I skipped it.


// Run algorithm
state.set_cuda_stream(nvbench::make_cuda_stream_view(cudf::default_stream_value.value()));
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
// Allocate device-side temporary storage & run algorithm
cudf::io::json::detail::parse_nested_json(input, cudf::default_stream_value);
});
}

NVBENCH_BENCH(BM_NESTED_JSON)
.set_name("nested_json_gpu_parser")
.add_int64_power_of_two_axis("string_size", nvbench::range(20, 31, 1));

} // namespace cudf