Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet decimal128 support #9706

Closed
wants to merge 138 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
138 commits
Select commit Hold shift + click to select a range
c8a171c
Initial changes
Jul 20, 2021
afe6ec6
More changes
Jul 20, 2021
43b615a
Small cleanup
Jul 20, 2021
ebedcad
Small cleanup
Jul 20, 2021
1d2e0b4
Removal of device_storage_type_id, formatting and more
Jul 21, 2021
2ea39fe
Formatting
Jul 21, 2021
606d6e3
`cudf::round` support for `__int128_t`
Jul 21, 2021
ee70203
Enable tests & fixes
Jul 21, 2021
fd6157b
Missing changes
Jul 23, 2021
d4506af
Scan, column_wrapper, orc, etc
Jul 23, 2021
791e91c
Binop changes
Jul 23, 2021
ad5fe35
detail::to_string
Jul 24, 2021
7cc9db1
Aggregation changes
Jul 24, 2021
5dd6874
Small fix in fixed_point.hpp
Jul 25, 2021
a89f958
Enable quantile
Jul 25, 2021
a16a2b8
Comment update
Jul 26, 2021
e89a9ba
REDUCTION_TEST working changes
Jul 26, 2021
7ef28bf
ROLLING_TEST changes
Jul 26, 2021
7fd4ac4
Initial changes for STRINGS_TEST
Jul 26, 2021
016c35a
STRINGS changes
Jul 27, 2021
dbd0504
Clean up
Jul 27, 2021
9c764e6
Merge remote-tracking branch 'upstream/branch-21.10' into decimal128
Jul 27, 2021
bf34d20
std::is_same_v
Jul 27, 2021
103a4db
is_integral & is_arithmetic
Jul 27, 2021
575fca7
Clean up
Jul 27, 2021
8549753
Fixes / cleanup
Jul 27, 2021
22de55a
DECIMAL128 custom reduction tests
Jul 27, 2021
5b69c0c
Another REDUCTION test
Jul 27, 2021
95667c8
numeric_limits / temporary cleanup
Jul 28, 2021
825ab86
More changes, 10+ files
Jul 29, 2021
f6c0938
Merge branch 'branch-21.10' into decimal128
Jul 29, 2021
321761c
Fix for TRANSFORM_TEST
Jul 29, 2021
02b0044
Rename FixedPointTestBothReps
Jul 29, 2021
95a107c
test group_by for only decimal32/64
Aug 3, 2021
0d8aa36
Using cuda::std:: for utility functions
Aug 4, 2021
73b3682
cudf::fill(_in_place) fix for decimal128
Aug 4, 2021
bcd1836
Remove TODOs
Aug 5, 2021
84f394b
Initial string conversion changes
Aug 23, 2021
754156a
Merge branch 'branch-21.10' into decimal128
Aug 24, 2021
7031551
Final string changes
Aug 24, 2021
ea97b9d
Enhance casting tests for decimal128
Aug 25, 2021
b98290c
Merge branch 'branch-21.10' into decimal128
Aug 25, 2021
363e0ed
Merge branch 'branch-21.10' into decimal128
Aug 25, 2021
655ccee
Merge conflict fixes
Aug 26, 2021
2a894bd
Missed STRINGS fixes
Aug 26, 2021
d881321
Enhance STRINGS_TEST
Aug 26, 2021
1380a0c
Enhance ROUND tests
Aug 26, 2021
b5d4493
Fix FIXED_POINT_TESTs
Aug 26, 2021
8715196
Enhance GROUPBY_TEST for decimal128
Aug 27, 2021
7952e90
Delete commented out code
Aug 27, 2021
3115666
Merge branch 'branch-21.10' into decimal128
Aug 27, 2021
932747e
Merge branch 'branch-21.10' into decimal128
Aug 28, 2021
10d58a3
Support hash groupby decimal128 (by making is sort) - initial change
Aug 31, 2021
60ce655
has_atomic_support
Aug 31, 2021
28aca7d
TEMPORARY - will revert later
Sep 1, 2021
4b52596
Merge branch 'branch-21.10' into decimal128
Sep 2, 2021
2951b2f
Merge branch 'branch-21.10' into decimal128
Sep 7, 2021
b515a93
Merge branch 'branch-21.10' into decimal128
Sep 8, 2021
fe446a4
Block group_by mean for decimal types
Sep 9, 2021
39d2573
Merge branch 'branch-21.10' into decimal128
Sep 9, 2021
efd0b62
Revert non-comprehensive fix
Sep 9, 2021
c52769a
Merge branch 'branch-21.10' into decimal128
Sep 9, 2021
5622a84
binary op changes
Sep 17, 2021
5ebd1bb
add checks to jit binary op
Sep 17, 2021
cb4e389
Final changes for binary ops
Sep 21, 2021
4c81f57
Add more binop tests
Sep 21, 2021
58b23cd
Temporary fix for chrono groupby min_tests
Sep 22, 2021
1f3284f
decimal128 comparision tests
Sep 22, 2021
7713bc4
Enhance decimal128 comparison tests
Sep 22, 2021
2de00b8
small cleanup
Sep 22, 2021
ea36188
cleanup
Sep 23, 2021
c7c0d9d
Merge branch 'decimal128' of https://gitlab-master.nvidia.com/choekst…
Oct 5, 2021
3bf389b
Merge branch 'branch-21.12' into decimal128
Oct 5, 2021
d093ae8
Fix rounding issues with DECIMAL128
revans2 Oct 5, 2021
4d82d30
Merge branch 'decimal128_round' into 'decimal128'
Oct 6, 2021
7eedaea
Use numeric::detail::abs in round.cu
Oct 6, 2021
892df4f
Merge branch 'branch-21.12' into decimal128
codereport Oct 20, 2021
a810927
Add cuda:: and if constexpr check
codereport Oct 20, 2021
9286b43
Clang format :)
codereport Oct 20, 2021
4ad26f4
Cleanup
codereport Oct 22, 2021
3892e73
Cleanup
codereport Oct 22, 2021
8e9bd90
Missing clang-format
codereport Oct 22, 2021
41cc23a
digits10
codereport Oct 22, 2021
921ff12
Clean up
codereport Oct 22, 2021
a5e4187
IO changes
codereport Oct 22, 2021
d87c9d4
Fix and partial test updates
codereport Oct 22, 2021
3b9a611
Clean up
codereport Oct 25, 2021
5bab167
Update libcudacxx
codereport Oct 25, 2021
a4c03e5
Fixing OrcWriterTestDecimal.Decimal64 test
codereport Oct 26, 2021
976fb74
Fix rest of ORC_TEST
codereport Oct 26, 2021
c9c7250
ORC changes for decimal128
codereport Oct 26, 2021
46bd2d8
ORC fixes for decima128
codereport Oct 26, 2021
8a86d76
Binary op changes / GROUPBY_TEST working
codereport Oct 27, 2021
e54d3fa
Test for blog
codereport Oct 28, 2021
85c52ad
Merge branch 'branch-21.12' into decimal128
codereport Nov 1, 2021
92694b8
Merge conflict fix
codereport Nov 1, 2021
44d0573
Temporary fix
codereport Nov 2, 2021
99a82ee
Update CONTRIBUTING.md
codereport Nov 2, 2021
c061a54
Merge branch 'branch-21.12' into decimal128
codereport Nov 2, 2021
99ad08b
Temporary
codereport Nov 2, 2021
abcc4db
Merge branch 'branch-21.12' into decimal128
codereport Nov 2, 2021
95a2402
Sum Aggregation uses same type for accumulator
codereport Nov 3, 2021
5ecd793
ORC changes
codereport Nov 3, 2021
f55e050
Full ORC fix
codereport Nov 3, 2021
216385a
clang-format
codereport Nov 3, 2021
67858b6
Merge branch 'branch-21.12' into decimal128
codereport Nov 4, 2021
7ba47c7
Reapply temporary fix
codereport Nov 4, 2021
1034057
Perf improvement for rescale
codereport Nov 4, 2021
d6e9ee8
default to dec64;make1128 slectable;fix tests;add options test
vuule Nov 3, 2021
4411d8e
use paths for decimal types API; iron out generated column names
vuule Nov 5, 2021
61b3677
small clean up
vuule Nov 5, 2021
fb04067
Merge branch 'branch-21.12' into decimal128
codereport Nov 8, 2021
7c01f21
ROLLING_TEST fix
codereport Nov 8, 2021
63a0004
clang-format
codereport Nov 8, 2021
d3c589c
Update meta.yaml
codereport Nov 8, 2021
27a2e58
Cmake formatting
codereport Nov 8, 2021
9e2184f
Cleaning up has_atomic_support
codereport Nov 8, 2021
8634dea
Cleanup
codereport Nov 8, 2021
4b5dbe2
Use has_atomic_support
codereport Nov 9, 2021
420abcc
Merge branch 'branch-21.12' into decimal128
codereport Nov 9, 2021
860bcbb
Fix silent failure
codereport Nov 9, 2021
89004c7
docs cleanup
codereport Nov 9, 2021
12e5b20
Cleanup
codereport Nov 9, 2021
46368f3
Merge branch 'branch-21.12' into decimal128
codereport Nov 10, 2021
3ef6a09
Additional decimal128 string tests
codereport Nov 10, 2021
287cfaf
Merge branch 'branch-21.12' into decimal128
codereport Nov 10, 2021
ec8e74a
count_digits
codereport Nov 11, 2021
e365080
final string changes
codereport Nov 11, 2021
0b4fd80
Merge branch 'branch-21.12' into decimal128
codereport Nov 12, 2021
a0d5d0c
use enable_if
codereport Nov 12, 2021
dd37950
clang-format
codereport Nov 12, 2021
fc4c1d1
Fix fix
codereport Nov 12, 2021
08da157
Cleanup
codereport Nov 15, 2021
c23038b
Merge branch 'branch-21.12' into decimal128
codereport Nov 15, 2021
201a091
is_chrono min/max identity
codereport Nov 16, 2021
f0afd8d
Use exp10
codereport Nov 16, 2021
95ee95c
clang-format
codereport Nov 16, 2021
0b7c32e
Writer changes
devavret Nov 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/src/io/parquet/chunk_dict.cu
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ __global__ void __launch_bounds__(block_size, 1)
return 4 + data_col.element<string_view>(val_idx).size_bytes();
}
case Type::FIXED_LEN_BYTE_ARRAY:
if (data_col.type().id() == type_id::DECIMAL128) { return 16; }
default: cudf_assert(false && "Unsupported type for dictionary encoding"); return 0;
}
}();
Expand Down
29 changes: 27 additions & 2 deletions cpp/src/io/parquet/page_enc.cu
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,8 @@ __global__ void __launch_bounds__(block_size)
}
}
dtype = s->col.physical_type;
dtype_len = (dtype == INT96) ? 12
dtype_len = (dtype == FIXED_LEN_BYTE_ARRAY) ? 16
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably also check if col type is decimal128

: (dtype == INT96) ? 12
: (dtype == INT64 || dtype == DOUBLE) ? 8
: (dtype == BOOLEAN) ? 1
: 4;
Expand Down Expand Up @@ -878,7 +879,8 @@ __global__ void __launch_bounds__(128, 8)
// Encode data values
__syncthreads();
dtype = s->col.physical_type;
dtype_len_out = (dtype == INT96) ? 12
dtype_len_out = (dtype == FIXED_LEN_BYTE_ARRAY) ? 16
: (dtype == INT96) ? 12
: (dtype == INT64 || dtype == DOUBLE) ? 8
: (dtype == BOOLEAN) ? 1
: 4;
Expand Down Expand Up @@ -1087,6 +1089,29 @@ __global__ void __launch_bounds__(128, 8)
dst[pos + 3] = v >> 24;
if (v != 0) memcpy(dst + pos + 4, str.data(), v);
} break;
case FIXED_LEN_BYTE_ARRAY: {
if (s->col.leaf_column->type().id() == type_id::DECIMAL128) {
// When using FIXED_LEN_BYTE_ARRAY for decimals, the rep is encoded in big-endian
auto v = s->col.leaf_column->element<numeric::decimal128>(val_idx).value();
auto v_ = reinterpret_cast<char*>(&v);
dst[pos + 0] = v_[15];
dst[pos + 1] = v_[14];
dst[pos + 2] = v_[13];
dst[pos + 3] = v_[12];
dst[pos + 4] = v_[11];
dst[pos + 5] = v_[10];
dst[pos + 6] = v_[9];
dst[pos + 7] = v_[8];
dst[pos + 8] = v_[7];
dst[pos + 9] = v_[6];
dst[pos + 10] = v_[5];
dst[pos + 11] = v_[4];
dst[pos + 12] = v_[3];
dst[pos + 13] = v_[2];
dst[pos + 14] = v_[1];
dst[pos + 15] = v_[0];
}
} break;
}
}
__syncthreads();
Expand Down
9 changes: 6 additions & 3 deletions cpp/src/io/parquet/writer_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,9 @@ struct leaf_schema_fn {
col_schema.type = Type::INT64;
col_schema.stats_dtype = statistics_dtype::dtype_decimal64;
} else if (std::is_same_v<T, numeric::decimal128>) {
CUDF_FAIL("decimal128 currently not supported for parquet writer");
col_schema.type = Type::FIXED_LEN_BYTE_ARRAY;
col_schema.type_length = 16;
col_schema.stats_dtype = statistics_dtype::dtype_decimal128;
} else {
CUDF_FAIL("Unsupported fixed point type for parquet writer");
}
Expand Down Expand Up @@ -1208,8 +1210,9 @@ void writer::impl::write(table_view const& table)
hostdevice_2dvector<gpu::EncColumnChunk> chunks(num_rowgroups, num_columns, stream);
for (uint32_t r = 0, global_r = global_rowgroup_base, f = 0, start_row = 0; r < num_rowgroups;
r++, global_r++) {
uint32_t fragments_in_chunk = (uint32_t)(
(md.row_groups[global_r].num_rows + max_page_fragment_size - 1) / max_page_fragment_size);
uint32_t fragments_in_chunk =
(uint32_t)((md.row_groups[global_r].num_rows + max_page_fragment_size - 1) /
max_page_fragment_size);
md.row_groups[global_r].total_byte_size = 0;
md.row_groups[global_r].columns.resize(num_columns);
for (int i = 0; i < num_columns; i++) {
Expand Down
52 changes: 52 additions & 0 deletions cpp/tests/io/parquet_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -463,6 +463,58 @@ TEST_F(ParquetWriterTest, MultiColumn)
cudf::test::expect_metadata_equal(expected_metadata, result.metadata);
}

TEST_F(ParquetWriterTest, DecimalColumns)
{
constexpr auto num_rows = 5;

// auto col0_data = random_values<bool>(num_rows);
auto col6_vals = random_values<int32_t>(num_rows);
auto col7_vals = random_values<int64_t>(num_rows);
auto col6_data = cudf::detail::make_counting_transform_iterator(0, [col6_vals](auto i) {
return numeric::decimal32{col6_vals[i], numeric::scale_type{5}};
});
auto col7_data = cudf::detail::make_counting_transform_iterator(0, [col6_vals](auto i) {
return numeric::decimal64{col6_vals[i], numeric::scale_type{5}};
});
auto col8_data = cudf::detail::make_counting_transform_iterator(0, [col6_vals](auto i) {
return numeric::decimal128{i * 10000, numeric::scale_type{2}};
});
auto validity = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return true; });

// column_wrapper<bool> col0{
// col0_data.begin(), col0_data.end(), validity};
column_wrapper<numeric::decimal32> col6{col6_data, col6_data + num_rows, validity};
column_wrapper<numeric::decimal64> col7{col7_data, col7_data + num_rows, validity};
column_wrapper<numeric::decimal128> col8{col8_data, col8_data + num_rows, validity};

std::vector<std::unique_ptr<column>> cols;
// cols.push_back(col0.release());
cols.push_back(col6.release());
cols.push_back(col7.release());
cols.push_back(col8.release());
auto expected = std::make_unique<table>(std::move(cols));
EXPECT_EQ(3, expected->num_columns());

cudf_io::table_input_metadata expected_metadata(*expected);
// expected_metadata.column_metadata[0].set_name( "bools");
expected_metadata.column_metadata[0].set_name("decimal32s").set_decimal_precision(10);
expected_metadata.column_metadata[1].set_name("decimal64s").set_decimal_precision(10);
expected_metadata.column_metadata[2].set_name("decimal128s").set_decimal_precision(10);

auto filepath = ("MultiColumn.parquet");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails because the reader doesn't support reading decimal128 but the file created by this test can be read with pyarrow to confirm that the writing is correct.

cudf_io::parquet_writer_options out_opts =
cudf_io::parquet_writer_options::builder(cudf_io::sink_info{filepath}, expected->view())
.metadata(&expected_metadata);
cudf_io::write_parquet(out_opts);

cudf_io::parquet_reader_options in_opts =
cudf_io::parquet_reader_options::builder(cudf_io::source_info{filepath});
auto result = cudf_io::read_parquet(in_opts);

CUDF_TEST_EXPECT_TABLES_EQUAL(expected->view(), result.tbl->view());
cudf::test::expect_metadata_equal(expected_metadata, result.metadata);
}

TEST_F(ParquetWriterTest, MultiColumnWithNulls)
{
constexpr auto num_rows = 100;
Expand Down