Skip to content

Commit

Permalink
Fix column selection read_parquet benchmarks (#13082)
Browse files Browse the repository at this point in the history
Helper function `get_col_names` in the Parquet reader benchmarks throws with nested columns. It should instead just ignore the children columns and return the top-level colum names.
Also renamed the function to better reflect what it does.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - https://github.com/nvdbaranec
  - Yunsong Wang (https://github.com/PointKernel)

URL: #13082
  • Loading branch information
vuule authored Apr 7, 2023
1 parent e28c9c5 commit 46b5900
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 5 deletions.
4 changes: 2 additions & 2 deletions cpp/benchmarks/io/orc/orc_reader_options.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ constexpr int64_t data_size = 512 << 20;
// Each call reads roughly equal amounts of data
constexpr int32_t chunked_read_num_chunks = 8;

std::vector<std::string> get_col_names(cudf::io::source_info const& source)
std::vector<std::string> get_top_level_col_names(cudf::io::source_info const& source)
{
auto const top_lvl_cols = cudf::io::read_orc_metadata(source).schema().root().children();
std::vector<std::string> col_names;
Expand Down Expand Up @@ -79,7 +79,7 @@ void BM_orc_read_varying_options(nvbench::state& state,
cudf::io::write_orc(options);

auto const cols_to_read =
select_column_names(get_col_names(source_sink.make_source_info()), ColSelection);
select_column_names(get_top_level_col_names(source_sink.make_source_info()), ColSelection);
cudf::io::orc_reader_options read_options =
cudf::io::orc_reader_options::builder(source_sink.make_source_info())
.columns(cols_to_read)
Expand Down
5 changes: 2 additions & 3 deletions cpp/benchmarks/io/parquet/parquet_reader_options.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
constexpr std::size_t data_size = 512 << 20;
constexpr std::size_t row_group_size = 128 << 20;

std::vector<std::string> get_col_names(cudf::io::source_info const& source)
std::vector<std::string> get_top_level_col_names(cudf::io::source_info const& source)
{
cudf::io::parquet_reader_options const read_options =
cudf::io::parquet_reader_options::builder(source);
Expand All @@ -39,7 +39,6 @@ std::vector<std::string> get_col_names(cudf::io::source_info const& source)
std::vector<std::string> names;
names.reserve(schema.size());
std::transform(schema.cbegin(), schema.cend(), std::back_inserter(names), [](auto const& c) {
CUDF_EXPECTS(c.children.empty(), "nested types are not supported");
return c.name;
});
return names;
Expand Down Expand Up @@ -81,7 +80,7 @@ void BM_parquet_read_options(nvbench::state& state,
cudf::io::write_parquet(options);

auto const cols_to_read =
select_column_names(get_col_names(source_sink.make_source_info()), ColSelection);
select_column_names(get_top_level_col_names(source_sink.make_source_info()), ColSelection);
cudf::io::parquet_reader_options read_options =
cudf::io::parquet_reader_options::builder(source_sink.make_source_info())
.columns(cols_to_read)
Expand Down

0 comments on commit 46b5900

Please sign in to comment.