Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TPC-H inspired examples for Libcudf #16088

Merged
merged 125 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
6e34bf8
Add basic cudf code to project / select
JayjeetAtGithub Jun 19, 2024
b20a8e6
wip q1
JayjeetAtGithub Jun 25, 2024
c36cc16
q1 done
JayjeetAtGithub Jun 25, 2024
90fbcb3
filtering using timestamp
JayjeetAtGithub Jun 25, 2024
6ec4748
Finish q1
JayjeetAtGithub Jun 25, 2024
18a9209
Finish q1
JayjeetAtGithub Jun 25, 2024
4f1aad8
Add column names
JayjeetAtGithub Jun 25, 2024
c86ce5e
Remove unnecessary memory copies
JayjeetAtGithub Jun 25, 2024
8174b99
Add a query plan diagram
JayjeetAtGithub Jun 25, 2024
b02cd03
Add helper functions
JayjeetAtGithub Jun 26, 2024
2fb3198
Convert scalar to device buffer
JayjeetAtGithub Jun 26, 2024
08e6e7e
Extract create metadata into utils
JayjeetAtGithub Jun 26, 2024
22bda7c
Add copyright to tpch q6
JayjeetAtGithub Jun 26, 2024
e68fcaa
Add initial README
JayjeetAtGithub Jun 26, 2024
143d32c
Extract order by into utils
JayjeetAtGithub Jun 26, 2024
7dd2cdb
Use make_column_from_scalar factory function
JayjeetAtGithub Jun 26, 2024
06f02c1
Add data gen instructions in README
JayjeetAtGithub Jun 26, 2024
62f4960
Fixes
JayjeetAtGithub Jun 26, 2024
eb03a3b
Fixes
JayjeetAtGithub Jun 26, 2024
d06ac1c
Move append_col_to_table to utils
JayjeetAtGithub Jun 26, 2024
c3631d8
Fix scale errors in q1
JayjeetAtGithub Jun 26, 2024
60e8fef
misc fixes
JayjeetAtGithub Jun 26, 2024
5b5237e
Remove useless headers
JayjeetAtGithub Jun 26, 2024
512b0ec
Finish q6
JayjeetAtGithub Jun 27, 2024
210b3d8
Update README
JayjeetAtGithub Jun 27, 2024
f420cf2
Cleanup q1/q6
JayjeetAtGithub Jun 27, 2024
182d9c2
measure query exec time
JayjeetAtGithub Jun 27, 2024
b6b5985
start working on q5
JayjeetAtGithub Jun 27, 2024
79cd0ea
Add q5
JayjeetAtGithub Jun 27, 2024
bcee9d8
Update README
JayjeetAtGithub Jun 27, 2024
3c95cc2
Implement fixed point scalar
JayjeetAtGithub Jun 28, 2024
e85cf41
Fix fixed_point_scalar init in q1
JayjeetAtGithub Jun 28, 2024
3d50688
Remove fixed point scalar
JayjeetAtGithub Jun 28, 2024
46387c4
Extract groupby into utils
JayjeetAtGithub Jun 28, 2024
b0ac3ee
Clean up Q5
JayjeetAtGithub Jun 28, 2024
4ff6293
Remove test.cp
JayjeetAtGithub Jun 29, 2024
d20bac0
Add copyright notice
JayjeetAtGithub Jun 29, 2024
b297e98
Clean up Q5
JayjeetAtGithub Jun 29, 2024
1cd06f8
Fix naming of variables
JayjeetAtGithub Jun 29, 2024
6d786c4
Remove more headers from Q5
JayjeetAtGithub Jun 29, 2024
38997ac
Refactor Q1
JayjeetAtGithub Jun 29, 2024
dc625c4
Add finished q9
JayjeetAtGithub Jun 29, 2024
e398abc
Add finished q9
JayjeetAtGithub Jun 29, 2024
0f4516c
Update README
JayjeetAtGithub Jun 29, 2024
5ca3420
Fix sql query in Q5
JayjeetAtGithub Jun 29, 2024
dd3a1d5
Fix comments
JayjeetAtGithub Jun 29, 2024
6e87bc9
Refactor Q6
JayjeetAtGithub Jun 29, 2024
12ddb78
Extract base dataset dir into utils
JayjeetAtGithub Jun 30, 2024
c45577f
Remove unnecessary rmm imports
JayjeetAtGithub Jun 30, 2024
07cfd36
Rearrange utils.hpp
JayjeetAtGithub Jul 1, 2024
da87ff4
Push down projections into read parquet
JayjeetAtGithub Jul 2, 2024
a423ea2
Add comments to utils.hpp
JayjeetAtGithub Jul 2, 2024
5312f34
Remove implementation status from README
JayjeetAtGithub Jul 1, 2024
6ef9d12
Add nvtx ranges to helper functions
JayjeetAtGithub Jul 2, 2024
903cb67
Add more nvtx ranges
JayjeetAtGithub Jul 2, 2024
c0be319
Add script to run the benchmarks
JayjeetAtGithub Jul 2, 2024
9ed2a09
Fix README
JayjeetAtGithub Jul 2, 2024
7145fe8
Pass the base dir as a cli arg
JayjeetAtGithub Jul 2, 2024
bbf40e2
Pass dataset path as an argument to tpch/run.sh
JayjeetAtGithub Jul 2, 2024
a8b8255
Use memory resource for q1
JayjeetAtGithub Jul 2, 2024
8f66bfb
Use a memory pool
JayjeetAtGithub Jul 2, 2024
1335ff9
Add info to check args
JayjeetAtGithub Jul 3, 2024
e105141
Measure the query execution time using the Timer implementation from …
JayjeetAtGithub Jul 3, 2024
33b90ba
Turn on/off memory pol usage
JayjeetAtGithub Jul 3, 2024
5dda317
Add view creation sql to q1/q6
JayjeetAtGithub Jul 3, 2024
684e5e2
Push down filters in Q5
JayjeetAtGithub Jul 3, 2024
45a4b62
remove run.sh
JayjeetAtGithub Jul 3, 2024
68377ce
Cleanup parsing cli arguments
JayjeetAtGithub Jul 3, 2024
dec52ae
Rename groupby_context to groupby_context_t
JayjeetAtGithub Jul 3, 2024
f7ec78d
Add support for managed memory
JayjeetAtGithub Jul 3, 2024
4b9972e
Refactor mem management code
JayjeetAtGithub Jul 3, 2024
e69c34b
Fix indentation of queries
JayjeetAtGithub Jul 3, 2024
47199d5
Rename predicates
JayjeetAtGithub Jul 3, 2024
b7a57a7
Fix comments
JayjeetAtGithub Jul 3, 2024
2cb21ed
Dynamically determine scale of bin op
JayjeetAtGithub Jul 4, 2024
057f54a
use east const
JayjeetAtGithub Jul 4, 2024
87648c5
Rename rmm utilties
JayjeetAtGithub Jul 5, 2024
4798527
Fix append function
JayjeetAtGithub Jul 5, 2024
3032572
Address col id by name
JayjeetAtGithub Jul 5, 2024
2f1defa
Fix col id addressing
JayjeetAtGithub Jul 5, 2024
adde65b
Add name to col_id addressing
JayjeetAtGithub Jul 5, 2024
b805c41
Add comments
JayjeetAtGithub Jul 5, 2024
b7e25c9
Remove plot.png
JayjeetAtGithub Jul 5, 2024
43733e2
Fix the calc functions
JayjeetAtGithub Jul 5, 2024
4ae4538
Run clang-format
JayjeetAtGithub Jul 5, 2024
f0d6f6a
Change q5 for benchmarks in dt04
JayjeetAtGithub Jul 5, 2024
91ac503
Fix q5
JayjeetAtGithub Jul 5, 2024
cb0550c
Use float64 instead of decimal64
JayjeetAtGithub Jul 5, 2024
545cfb9
Add stream/mr params to new col calc functions
JayjeetAtGithub Jul 6, 2024
cadc195
Add stream / mr params
JayjeetAtGithub Jul 6, 2024
c552510
Fix the SQL queries
JayjeetAtGithub Jul 6, 2024
adf9456
Fix the join order for Q9
JayjeetAtGithub Jul 6, 2024
48d108a
Fix trailing whitespace
JayjeetAtGithub Jul 10, 2024
1bb2793
Update cpp/examples/tpch/CMakeLists.txt
JayjeetAtGithub Jul 11, 2024
174f998
Update cpp/examples/tpch/README.md
JayjeetAtGithub Jul 11, 2024
94e3b4e
Add docstring to remaining functions in utils
JayjeetAtGithub Jul 11, 2024
e8c8abb
Make the one scalars const
JayjeetAtGithub Jul 11, 2024
d830680
Add docstrings to column calculation functions
JayjeetAtGithub Jul 11, 2024
f75430d
Merge branch 'branch-24.08' into tpch-bench
mhaseeb123 Jul 11, 2024
e297deb
Add file-level docstring to Q1
JayjeetAtGithub Jul 11, 2024
146d45b
Add file-level docstring to Q5
JayjeetAtGithub Jul 11, 2024
90903f2
Add file-level docstring to Q6
JayjeetAtGithub Jul 11, 2024
0c78691
Add file-level docstring to Q9
JayjeetAtGithub Jul 11, 2024
194f08f
Add docstring to join_and_gather function
JayjeetAtGithub Jul 11, 2024
6b12762
Add consts to join_and_gather
JayjeetAtGithub Jul 11, 2024
9dd31a7
Add more const literals
JayjeetAtGithub Jul 11, 2024
0e98994
Add more const literals
JayjeetAtGithub Jul 11, 2024
792f33e
Add consts, nodiscards, and other qualifiers
JayjeetAtGithub Jul 11, 2024
5e103b9
Add more const references
JayjeetAtGithub Jul 11, 2024
a32e899
More improvements
JayjeetAtGithub Jul 11, 2024
781a460
Add [[nodiscard]] to append
JayjeetAtGithub Jul 12, 2024
4578a4b
Use std transform
JayjeetAtGithub Jul 12, 2024
fcdde79
Change to append in place
JayjeetAtGithub Jul 15, 2024
b0e764f
Address reviews
JayjeetAtGithub Jul 16, 2024
2435179
Add consts
JayjeetAtGithub Jul 16, 2024
d02cf08
Add consts
JayjeetAtGithub Jul 16, 2024
750d6a9
Misc fixes
JayjeetAtGithub Jul 16, 2024
0f984f0
Extract timer into cudf/utilities/timer.hpp
JayjeetAtGithub Jul 16, 2024
14be022
Add a cudf::timer
JayjeetAtGithub Jul 16, 2024
0ae2863
Fix the invalid cli args message
JayjeetAtGithub Jul 16, 2024
2ce182d
Allow chaining of appends
JayjeetAtGithub Jul 16, 2024
38f33ed
Move timer to a utilities dir in examples
JayjeetAtGithub Jul 16, 2024
c368953
Fix parquet_io and add the utilities dir
JayjeetAtGithub Jul 16, 2024
0fdf2f3
Move timer into cudf::example namespacE
JayjeetAtGithub Jul 16, 2024
c153636
Fix the consts for table with names fn
JayjeetAtGithub Jul 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/examples/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ build_example() {
}

build_example basic
build_example tpch
build_example strings
build_example nested_types
build_example parquet_io
4 changes: 3 additions & 1 deletion cpp/examples/parquet_io/parquet_io.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

#include "parquet_io.hpp"

#include "../utilities/timer.hpp"

/**
* @file parquet_io.cpp
* @brief Demonstrates usage of the libcudf APIs to read and write
Expand Down Expand Up @@ -140,7 +142,7 @@ int main(int argc, char const** argv)
<< page_stat_string << ".." << std::endl;

// `timer` is automatically started here
Timer timer;
cudf::examples::timer timer;
write_parquet(input->view(), metadata, output_filepath, encoding, compression, page_stats);
timer.print_elapsed_millis();

Expand Down
31 changes: 0 additions & 31 deletions cpp/examples/parquet_io/parquet_io.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -124,34 +124,3 @@ std::shared_ptr<rmm::mr::device_memory_resource> create_memory_resource(bool is_

return std::nullopt;
}

/**
* @brief Light-weight timer for parquet reader and writer instrumentation
*
* Timer object constructed from std::chrono, instrumenting at microseconds
* precision. Can display elapsed durations at milli and micro second
* scales. Timer starts at object construction.
*/
class Timer {
public:
using micros = std::chrono::microseconds;
using millis = std::chrono::milliseconds;

Timer() { reset(); }
void reset() { start_time = std::chrono::high_resolution_clock::now(); }
auto elapsed() { return (std::chrono::high_resolution_clock::now() - start_time); }
void print_elapsed_micros()
{
std::cout << "Elapsed Time: " << std::chrono::duration_cast<micros>(elapsed()).count()
<< "us\n\n";
}
void print_elapsed_millis()
{
std::cout << "Elapsed Time: " << std::chrono::duration_cast<millis>(elapsed()).count()
<< "ms\n\n";
}

private:
using time_point_t = std::chrono::time_point<std::chrono::high_resolution_clock>;
time_point_t start_time;
};
32 changes: 32 additions & 0 deletions cpp/examples/tpch/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Copyright (c) 2024, NVIDIA CORPORATION.

cmake_minimum_required(VERSION 3.26.4)

include(../set_cuda_architecture.cmake)

rapids_cuda_init_architectures(tpch_example)
rapids_cuda_set_architectures(RAPIDS)

project(
tpch_example
VERSION 0.0.1
LANGUAGES CXX CUDA
)

include(../fetch_dependencies.cmake)

add_executable(tpch_q1 q1.cpp)
target_link_libraries(tpch_q1 PRIVATE cudf::cudf)
target_compile_features(tpch_q1 PRIVATE cxx_std_17)

add_executable(tpch_q5 q5.cpp)
target_link_libraries(tpch_q5 PRIVATE cudf::cudf)
target_compile_features(tpch_q5 PRIVATE cxx_std_17)

add_executable(tpch_q6 q6.cpp)
target_link_libraries(tpch_q6 PRIVATE cudf::cudf)
target_compile_features(tpch_q6 PRIVATE cxx_std_17)

add_executable(tpch_q9 q9.cpp)
target_link_libraries(tpch_q9 PRIVATE cudf::cudf)
target_compile_features(tpch_q9 PRIVATE cxx_std_17)
mhaseeb123 marked this conversation as resolved.
Show resolved Hide resolved
38 changes: 38 additions & 0 deletions cpp/examples/tpch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# TPC-H Inspired Examples

Implements TPC-H queries using `libcudf`. We leverage the data generator (wrapper around official TPC-H datagen) from [Apache Datafusion](https://github.com/apache/datafusion) for generating data in Parquet format.

## Requirements

- Rust

## Generating the Dataset

1. Clone the datafusion repository.
```bash
git clone [email protected]:apache/datafusion.git
```

2. Run the data generator. The data will be placed in a `data/` subdirectory.
```bash
cd datafusion/benchmarks/
./bench.sh data tpch

# for scale factor 10,
./bench.sh data tpch10
```

## Running Queries

1. Build the examples.
```bash
cd cpp/examples
./build.sh
```
The TPC-H query binaries would be built inside `examples/tpch/build`.

2. Execute the queries.
```bash
./tpch/build/tpch_q1
```
A parquet file named `q1.parquet` would be generated holding the results of the query.
174 changes: 174 additions & 0 deletions cpp/examples/tpch/q1.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "../utilities/timer.hpp"
#include "utils.hpp"

#include <cudf/ast/expressions.hpp>
#include <cudf/column/column.hpp>
#include <cudf/scalar/scalar.hpp>

/**
* @file q1.cpp
* @brief Implement query 1 of the TPC-H benchmark.
*
* create view lineitem as select * from '/tables/scale-1/lineitem.parquet';
*
* select
* l_returnflag,
* l_linestatus,
* sum(l_quantity) as sum_qty,
* sum(l_extendedprice) as sum_base_price,
* sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
* sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
* avg(l_quantity) as avg_qty,
* avg(l_extendedprice) as avg_price,
* avg(l_discount) as avg_disc,
* count(*) as count_order
* from
* lineitem
* where
* l_shipdate <= date '1998-09-02'
* group by
* l_returnflag,
* l_linestatus
* order by
* l_returnflag,
* l_linestatus;
*/

/**
* @brief Calculate the discount price column
*
* @param discount The discount column
* @param extendedprice The extended price column
* @param stream The CUDA stream used for device memory operations and kernel launches.
* @param mr Device memory resource used to allocate the returned column's device memory.
*/
[[nodiscard]] std::unique_ptr<cudf::column> calc_disc_price(
cudf::column_view const& discount,
cudf::column_view const& extendedprice,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())
{
auto const one = cudf::numeric_scalar<double>(1);
auto const one_minus_discount =
cudf::binary_operation(one, discount, cudf::binary_operator::SUB, discount.type(), stream, mr);
auto const disc_price_type = cudf::data_type{cudf::type_id::FLOAT64};
auto disc_price = cudf::binary_operation(extendedprice,
one_minus_discount->view(),
cudf::binary_operator::MUL,
disc_price_type,
stream,
mr);
return disc_price;
}

/**
* @brief Calculate the charge column
*
* @param tax The tax column
* @param disc_price The discount price column
* @param stream The CUDA stream used for device memory operations and kernel launches.
* @param mr Device memory resource used to allocate the returned column's device memory.
*/
[[nodiscard]] std::unique_ptr<cudf::column> calc_charge(
cudf::column_view const& tax,
cudf::column_view const& disc_price,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())
{
auto const one = cudf::numeric_scalar<double>(1);
auto const one_plus_tax =
cudf::binary_operation(one, tax, cudf::binary_operator::ADD, tax.type(), stream, mr);
auto const charge_type = cudf::data_type{cudf::type_id::FLOAT64};
auto charge = cudf::binary_operation(
disc_price, one_plus_tax->view(), cudf::binary_operator::MUL, charge_type, stream, mr);
return charge;
}

int main(int argc, char const** argv)
{
auto const args = parse_args(argc, argv);

// Use a memory pool
auto resource = create_memory_resource(args.memory_resource_type);
rmm::mr::set_current_device_resource(resource.get());

cudf::examples::timer timer;

// Define the column projections and filter predicate for `lineitem` table
std::vector<std::string> const lineitem_cols = {"l_returnflag",
"l_linestatus",
"l_quantity",
"l_extendedprice",
"l_discount",
"l_shipdate",
"l_orderkey",
"l_tax"};
auto const shipdate_ref = cudf::ast::column_reference(std::distance(
lineitem_cols.begin(), std::find(lineitem_cols.begin(), lineitem_cols.end(), "l_shipdate")));
auto shipdate_upper =
cudf::timestamp_scalar<cudf::timestamp_D>(days_since_epoch(1998, 9, 2), true);
auto const shipdate_upper_literal = cudf::ast::literal(shipdate_upper);
auto lineitem_pred = std::make_unique<cudf::ast::operation>(
cudf::ast::ast_operator::LESS_EQUAL, shipdate_ref, shipdate_upper_literal);

// Read out the `lineitem` table from parquet file
auto lineitem =
JayjeetAtGithub marked this conversation as resolved.
Show resolved Hide resolved
read_parquet(args.dataset_dir + "/lineitem.parquet", lineitem_cols, std::move(lineitem_pred));

// Calculate the discount price and charge columns and append to lineitem table
auto disc_price =
calc_disc_price(lineitem->column("l_discount"), lineitem->column("l_extendedprice"));
auto charge = calc_charge(lineitem->column("l_tax"), disc_price->view());
(*lineitem).append(disc_price, "disc_price").append(charge, "charge");

// Perform the group by operation
auto const groupedby_table = apply_groupby(
lineitem,
groupby_context_t{
{"l_returnflag", "l_linestatus"},
{
{"l_extendedprice",
{{cudf::aggregation::Kind::SUM, "sum_base_price"},
{cudf::aggregation::Kind::MEAN, "avg_price"}}},
{"l_quantity",
{{cudf::aggregation::Kind::SUM, "sum_qty"}, {cudf::aggregation::Kind::MEAN, "avg_qty"}}},
{"l_discount",
{
{cudf::aggregation::Kind::MEAN, "avg_disc"},
}},
{"disc_price",
{
{cudf::aggregation::Kind::SUM, "sum_disc_price"},
}},
{"charge",
{{cudf::aggregation::Kind::SUM, "sum_charge"},
{cudf::aggregation::Kind::COUNT_ALL, "count_order"}}},
}});

// Perform the order by operation
auto const orderedby_table = apply_orderby(groupedby_table,
{"l_returnflag", "l_linestatus"},
{cudf::order::ASCENDING, cudf::order::ASCENDING});

timer.print_elapsed_millis();

// Write query result to a parquet file
orderedby_table->to_parquet("q1.parquet");
return 0;
}
Loading
Loading