Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add struct support to parquet writer #7461

Merged
merged 105 commits into from
Mar 19, 2021
Merged
Show file tree
Hide file tree
Changes from 103 commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
26f96bf
Add column_device_view pointers to EncColumnDesc
kaatish Jan 7, 2021
1f5b6c4
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…
kaatish Jan 8, 2021
9aba2b5
Fix compilation
kaatish Jan 8, 2021
1946fed
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…
kaatish Jan 12, 2021
d507132
Replace outdated calls in page_dict
kaatish Jan 12, 2021
d286962
Add GetDtypeLogicalLen for column_device_view
kaatish Jan 12, 2021
f52fa95
PR comment fixes
kaatish Jan 14, 2021
bc4c771
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…
kaatish Jan 14, 2021
a5064bb
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…
kaatish Jan 20, 2021
e5d02f2
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…
kaatish Jan 21, 2021
d19b8f1
Fix tests
kaatish Jan 21, 2021
f7658c6
Schema fix
kaatish Jan 21, 2021
90b5da0
PR review fixes
kaatish Jan 22, 2021
9d54927
Built schema and linked_column_view
devavret Jan 7, 2021
0caffde
Converting linked_column_view to single inheritance column_view in pa…
devavret Jan 7, 2021
4ebb4fd
Merge remote-tracking branch 'kaatish/parquet-writer-col-device-view'…
devavret Jan 27, 2021
d0e2613
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…
kaatish Jan 28, 2021
57d6afd
PR review fixes
kaatish Jan 28, 2021
c2393a8
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…
kaatish Jan 28, 2021
f5203f5
PR review fixes
kaatish Jan 28, 2021
c80636a
Plumb new parquet column and schema into write()
devavret Jan 28, 2021
9b2070b
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…
kaatish Jan 29, 2021
25301a9
Format fixes
kaatish Jan 29, 2021
3522264
Removed get_mask_offset_word
kaatish Jan 29, 2021
bd0691f
PR review fixes
kaatish Feb 1, 2021
610c6b4
Merge remote-tracking branch 'kaatish/parquet-writer-col-device-view'…
devavret Feb 1, 2021
b828c31
Non-list structs can now be written
devavret Feb 3, 2021
73e8631
Fixed non-nested case.
devavret Feb 3, 2021
6b648d4
Re-enable non-int96 chrono type support
devavret Feb 3, 2021
dbfae18
Added column name to struct
devavret Feb 4, 2021
2731bc3
Merge remote-tracking branch 'origin/branch-0.19' into parquet-writer…
devavret Feb 5, 2021
b4494d5
1 level list working again
devavret Feb 8, 2021
5c722b1
Add new heirarchical input metadata
devavret Feb 9, 2021
48e7ad4
Fix bug in linked column view
devavret Feb 9, 2021
39be46d
Recovered commits from git corruption
devavret Feb 19, 2021
5d1e375
Fixing tests by using new input schema
devavret Feb 19, 2021
0952b10
Chunked writing struct with prescribed nullability
devavret Feb 22, 2021
3109323
Fixing an issue in struct chunked writing.
devavret Feb 22, 2021
6a9aefc
Little bit of test cleanups
devavret Feb 22, 2021
98c6ceb
Chunked writing List of struct
devavret Feb 22, 2021
aa35a44
Chunked: List of struct of struct of list of list
devavret Feb 23, 2021
1932646
Small enhancement that removes an unnecessary counting iterator from …
devavret Feb 23, 2021
6d91143
Sliced Non-list struct column
devavret Feb 23, 2021
7c62c9f
Initial commit
kaatish Feb 24, 2021
59da82b
Refactor stats calculation
kaatish Feb 24, 2021
97d5bb8
Writing of list/struct mixed case of sliced table
devavret Feb 24, 2021
5c9df3c
Style fix
kaatish Feb 24, 2021
e4730d5
Remove unnecessary max_string_sentinel
kaatish Feb 24, 2021
1f37237
Reorganize string_view functions
kaatish Feb 24, 2021
18c4ed9
Format fix
kaatish Feb 24, 2021
a80be62
Cleanup
kaatish Feb 24, 2021
274850b
Remove unnecessary header file
kaatish Feb 24, 2021
22fd3b0
Default names for children of structs
devavret Feb 24, 2021
2922fa2
PR review fixes
kaatish Feb 25, 2021
9547f59
Cleanups:
devavret Feb 25, 2021
751187e
Verify metadata structure is same as table
devavret Feb 25, 2021
2caea09
TODO annotations
devavret Feb 25, 2021
ec2deda
Fix regression introduced in sliced writing commit
devavret Feb 26, 2021
cb614d2
Fix max vals logic in dremel
devavret Feb 26, 2021
5ce33b4
Fix input metadata for strings.
devavret Feb 26, 2021
4abba0e
Cleanup dremel_data
devavret Feb 26, 2021
90cfc46
Merge remote-tracking branch 'rapidsai/branch-0.19' into parquet-writ…
devavret Feb 26, 2021
9e2d948
PR review fixes
kaatish Mar 1, 2021
1bb1815
Merge remote-tracking branch 'kaatish/statistics-cleanup' into parque…
devavret Mar 1, 2021
2a94bd0
Fixing merge issue.
devavret Mar 1, 2021
ad88e10
Thanks to kaatish's PR, strings now just work.
devavret Mar 1, 2021
ba956ee
Merge remote-tracking branch 'rapidsai/branch-0.19' into parquet-writ…
devavret Mar 2, 2021
67cca79
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into s…
kaatish Mar 3, 2021
b621654
Merge remote-tracking branch 'rapidsai/branch-0.19' into parquet-writ…
devavret Mar 3, 2021
4127bd6
PR review fixes
kaatish Mar 3, 2021
90c6714
PR review fixes
kaatish Mar 4, 2021
5c5a925
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into s…
kaatish Mar 4, 2021
fd41672
Fix compilation issues
kaatish Mar 4, 2021
dffb14f
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into s…
kaatish Mar 4, 2021
8d6f503
Style fix
kaatish Mar 4, 2021
eaa6f8d
Cython bindings and Add setters and getters to column input metadata
devavret Mar 4, 2021
39fdbd5
Fix compilation issues
kaatish Mar 4, 2021
5d7700c
Merge remote-tracking branch 'kaatish/statistics-cleanup' into parque…
devavret Mar 5, 2021
fd6d692
Cython bindings for chunked parquet writer
devavret Mar 5, 2021
f83a3b9
Enable int96 timestamps. Remove dependence on converted type in encod…
devavret Mar 5, 2021
42bb155
style fixes
devavret Mar 7, 2021
afb7e40
Merge remote-tracking branch 'rapidsai/branch-0.19' into parquet-writ…
devavret Mar 7, 2021
df948b0
Cleanups
devavret Mar 8, 2021
e762634
Update metadata API
devavret Mar 8, 2021
9d9efec
Move comon def level logic from dremel data calc
devavret Mar 9, 2021
5fe19b6
Fix dictionary allocation for sliced columns
devavret Mar 9, 2021
f248eb7
Add metadata roundtrip test.
devavret Mar 9, 2021
84349c8
Fix issue with empty string and list columns
devavret Mar 10, 2021
3b5eae1
Enable struct writing in python. Add pytest
devavret Mar 11, 2021
1cb80b3
API name change and spelling fixes
devavret Mar 11, 2021
afc0337
Attempting to fix broken CI
devavret Mar 11, 2021
44d9ae5
Use type dispatcher
devavret Mar 11, 2021
c53bd77
Merge branch 'branch-0.19' into parquet-writer-struct-schema
devavret Mar 12, 2021
b7eef93
Review changes
devavret Mar 15, 2021
7936ba7
Make API fluent and other API related review edits
devavret Mar 15, 2021
30f1c6c
Dave Baranec review fixes
devavret Mar 15, 2021
d8b1e77
Review fixes
devavret Mar 16, 2021
03d92c3
Add documentation regarding struct levels in get_dremel_data
devavret Mar 16, 2021
ee192c1
spelling mistake again in hierarchy
devavret Mar 16, 2021
a7c64f7
replace list with vector for path in schema and nullability
devavret Mar 16, 2021
075a571
Review cleanups
devavret Mar 17, 2021
8414464
Move long functions out of parquet_column_view
devavret Mar 17, 2021
4c2e632
Short explanation of indices used for level
devavret Mar 17, 2021
491b9c5
Review fix in cython
devavret Mar 18, 2021
401d11e
Move metadata copying to writer ctor
devavret Mar 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 172 additions & 61 deletions cpp/include/cudf/io/parquet.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,162 @@ table_with_metadata read_parquet(
* @{
* @file
*/
class table_input_metadata;

class column_in_metadata {
friend table_input_metadata;
std::string _name = "";
thrust::optional<bool> _nullable;
// TODO: This isn't implemented yet
bool _list_column_is_map = false;
bool _use_int96_timestamp = false;
// bool _output_as_binary = false;
thrust::optional<uint8_t> _decimal_precision;
std::vector<column_in_metadata> children;

public:
/**
* @brief Set the name of this column
*
* @return this for chaining
*/
column_in_metadata& set_name(std::string const& name)
{
_name = name;
return *this;
}

/**
* @brief Set the nullability of this column
*
* Only valid in case of chunked writes. In single writes, this option is ignored.
*
* @return column_in_metadata&
*/
column_in_metadata& set_nullability(bool nullable)
{
_nullable = nullable;
return *this;
}

/**
* @brief Specify that this list column should be encoded as a map in the written parquet file
*
* The column must have the structure list<struct<key, value>>. This option is invalid otherwise
*
* @return this for chaining
*/
column_in_metadata& set_list_column_as_map()
{
_list_column_is_map = true;
return *this;
}

/**
* @brief Specifies whether this timestamp column should be encoded using the deprecated int96
* physical type. Only valid for the following column types:
* timestamp_s, timestamp_ms, timestamp_us, timestamp_ns
*
* @param req True = use int96 physical type. False = use int64 physical type
* @return this for chaining
*/
column_in_metadata& set_int96_timestamps(bool req)
{
_use_int96_timestamp = req;
return *this;
}

/**
* @brief Set the decimal precision of this column. Only valid if this column is a decimal
* (fixed-point) type
*
* @param precision The integer precision to set for this decimal column
* @return this for chaining
*/
column_in_metadata& set_decimal_precision(uint8_t precision)
{
_decimal_precision = precision;
return *this;
}

/**
* @brief Get reference to a child of this column
*
* @param i Index of the child to get
* @return this for chaining
*/
column_in_metadata& child(size_type i) { return children[i]; }

/**
* @brief Get const reference to a child of this column
*
* @param i Index of the child to get
* @return this for chaining
*/
column_in_metadata const& child(size_type i) const { return children[i]; }

/**
* @brief Get the name of this column
*/
std::string get_name() const { return _name; }

/**
* @brief Get whether nullability has been explicitly set for this column.
*/
bool is_nullability_defined() const { return _nullable.has_value(); }

/**
* @brief Gets the explicitly set nullability for this column.
* @throws If nullability is not explicitly defined for this column.
* Check using `is_nullability_defined()` first.
*/
bool nullable() const { return _nullable.value(); }

/**
* @brief If this is the metadata of a list column, returns whether it is to be encoded as a map.
*/
bool is_map() const { return _list_column_is_map; }

/**
* @brief Get whether to encode this timestamp column using deprecated int96 physical type
*/
bool is_enabled_int96_timestamps() const { return _use_int96_timestamp; }

/**
* @brief Get whether precision has been set for this decimal column
*/
bool is_decimal_precision_set() const { return _decimal_precision.has_value(); }

/**
* @brief Get the decimal precision that was set for this column.
* @throws If decimal precision was not set for this column.
* Check using `is_decimal_precision_set()` first.
*/
uint8_t get_decimal_precision() const { return _decimal_precision.value(); }
devavret marked this conversation as resolved.
Show resolved Hide resolved

/**
* @brief Get the number of children of this column
*/
size_type num_children() const { return children.size(); }
};

class table_input_metadata {
public:
table_input_metadata() = default; // Required by cython

/**
* @brief Construct a new table_input_metadata from a table_view.
*
* The constructed table_input_metadata has the same structure as the passed table_view
*
* @param table The table_view to construct metadata for
* @param user_data Optional Additional metadata to encode, as key-value pairs
*/
table_input_metadata(table_view const& table, std::map<std::string, std::string> user_data = {});

std::vector<column_in_metadata> column_metadata;
std::map<std::string, std::string> user_data; //!< Format-dependent metadata as key-values pairs
nvdbaranec marked this conversation as resolved.
Show resolved Hide resolved
};

/**
* @brief Class to build `parquet_writer_options`.
Expand All @@ -395,14 +551,12 @@ class parquet_writer_options {
// Sets of columns to output
table_view _table;
// Optional associated metadata
const table_metadata* _metadata = nullptr;
// Parquet writes can write INT96 or TIMESTAMP_MICROS. Defaults to TIMESTAMP_MICROS.
table_input_metadata const* _metadata = nullptr;
// Parquet writer can write INT96 or TIMESTAMP_MICROS. Defaults to TIMESTAMP_MICROS.
// If true then overrides any per-column setting in _metadata.
bool _write_timestamps_as_int96 = false;
nvdbaranec marked this conversation as resolved.
Show resolved Hide resolved
// Column chunks file path to be set in the raw output metadata
std::string _column_chunks_file_path;
/// vector of precision values for decimal writing. Exactly one entry
/// per decimal column. Optional unless decimals are being written.
std::vector<uint8_t> _decimal_precision;

/**
* @brief Constructor from sink and table.
Expand Down Expand Up @@ -465,7 +619,7 @@ class parquet_writer_options {
/**
* @brief Returns associated metadata.
*/
table_metadata const* get_metadata() const { return _metadata; }
table_input_metadata const* get_metadata() const { return _metadata; }

/**
* @brief Returns `true` if timestamps will be written as INT96
Expand All @@ -477,17 +631,12 @@ class parquet_writer_options {
*/
std::string get_column_chunks_file_path() const { return _column_chunks_file_path; }

/**
* @brief Returns a constant reference to the decimal precision vector.
*/
std::vector<uint8_t> const& get_decimal_precision() const { return _decimal_precision; }

/**
* @brief Sets metadata.
*
* @param metadata Associated metadata.
*/
void set_metadata(table_metadata const* metadata) { _metadata = metadata; }
void set_metadata(table_input_metadata const* metadata) { _metadata = metadata; }

/**
* @brief Sets the level of statistics.
Expand Down Expand Up @@ -520,11 +669,6 @@ class parquet_writer_options {
{
_column_chunks_file_path.assign(file_path);
}

/**
* @brief Sets the decimal precision vector data.
*/
void set_decimal_precision(std::vector<uint8_t> dp) { _decimal_precision = std::move(dp); }
};

class parquet_writer_options_builder {
Expand Down Expand Up @@ -555,7 +699,7 @@ class parquet_writer_options_builder {
* @param metadata Associated metadata.
* @return this for chaining.
*/
parquet_writer_options_builder& metadata(table_metadata const* metadata)
parquet_writer_options_builder& metadata(table_input_metadata const* metadata)
{
options._metadata = metadata;
return *this;
Expand Down Expand Up @@ -672,11 +816,10 @@ class chunked_parquet_writer_options {
// Specify the level of statistics in the output file
statistics_freq _stats_level = statistics_freq::STATISTICS_ROWGROUP;
// Optional associated metadata.
const table_metadata_with_nullability* _nullable_metadata = nullptr;
// Parquet writes can write INT96 or TIMESTAMP_MICROS. Defaults to TIMESTAMP_MICROS.
table_input_metadata const* _metadata = nullptr;
// Parquet writer can write INT96 or TIMESTAMP_MICROS. Defaults to TIMESTAMP_MICROS.
// If true then overrides any per-column setting in _metadata.
bool _write_timestamps_as_int96 = false;
// Optional decimal precision data - must be present if writing decimals
std::vector<uint8_t> _decimal_precision = {};

/**
* @brief Constructor from sink.
Expand Down Expand Up @@ -711,40 +854,21 @@ class chunked_parquet_writer_options {
statistics_freq get_stats_level() const { return _stats_level; }

/**
* @brief Returns nullable metadata information.
* @brief Returns metadata information.
*/
const table_metadata_with_nullability* get_nullable_metadata() const
{
return _nullable_metadata;
}

/**
* @brief Returns decimal precision pointer.
*/
std::vector<uint8_t> const& get_decimal_precision() const { return _decimal_precision; }
table_input_metadata const* get_metadata() const { return _metadata; }

/**
* @brief Returns `true` if timestamps will be written as INT96
*/
bool is_enabled_int96_timestamps() const { return _write_timestamps_as_int96; }

/**
* @brief Sets nullable metadata.
* @brief Sets metadata.
*
* @param metadata Associated metadata.
*/
void set_nullable_metadata(const table_metadata_with_nullability* metadata)
{
_nullable_metadata = metadata;
}

/**
* @brief Sets decimal precision data.
*
* @param v Vector of precision data flattened with exactly one entry per
* decimal column.
*/
void set_decimal_precision_data(std::vector<uint8_t> const& v) { _decimal_precision = v; }
void set_metadata(table_input_metadata const* metadata) { _metadata = metadata; }

/**
* @brief Sets the level of statistics in parquet_writer_options.
Expand Down Expand Up @@ -797,15 +921,14 @@ class chunked_parquet_writer_options_builder {
chunked_parquet_writer_options_builder(sink_info const& sink) : options(sink){};

/**
* @brief Sets nullable metadata to chunked_parquet_writer_options.
* @brief Sets metadata to chunked_parquet_writer_options.
*
* @param metadata Associated metadata.
* @return this for chaining.
*/
chunked_parquet_writer_options_builder& nullable_metadata(
const table_metadata_with_nullability* metadata)
chunked_parquet_writer_options_builder& metadata(table_input_metadata const* metadata)
{
options._nullable_metadata = metadata;
options._metadata = metadata;
return *this;
}

Expand All @@ -821,18 +944,6 @@ class chunked_parquet_writer_options_builder {
return *this;
}

/**
* @brief Sets decimal precision data.
*
* @param v Vector of precision data flattened with exactly one entry per
* decimal column.
*/
chunked_parquet_writer_options_builder& decimal_precision(std::vector<uint8_t> const& v)
{
options._decimal_precision = v;
return *this;
}

/**
* @brief Sets compression type to chunked_parquet_writer_options.
*
Expand Down
16 changes: 16 additions & 0 deletions cpp/src/io/functions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,22 @@ std::unique_ptr<std::vector<uint8_t>> merge_rowgroup_metadata(
return detail_parquet::writer::merge_rowgroup_metadata(metadata_list);
}

table_input_metadata::table_input_metadata(table_view const& table,
std::map<std::string, std::string> user_data)
: user_data{std::move(user_data)}
{
// Create a metadata hierarchy using `table`
std::function<column_in_metadata(column_view const&)> get_children = [&](column_view const& col) {
auto col_meta = column_in_metadata{};
std::transform(
col.child_begin(), col.child_end(), std::back_inserter(col_meta.children), get_children);
return col_meta;
};

std::transform(
table.begin(), table.end(), std::back_inserter(this->column_metadata), get_children);
}

/**
* @copydoc cudf::io::write_parquet
*/
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/io/parquet/page_dict.cu
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ struct dict_state_s {
uint32_t num_dict_entries; //!< Dictionary entries in current fragment to add
uint32_t frag_dict_size;
EncColumnChunk ck;
EncColumnDesc col;
parquet_column_device_view col;
PageFragment frag;
volatile uint32_t scratch_red[32];
uint16_t frag_dict[max_page_fragment_size];
Expand Down
Loading