Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for list and struct type in ORC Reader #8599

Merged
Merged
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
7c0dcb5
python portion of supporting multiple ORC input files
jdye64 Apr 25, 2021
739c32d
removed commented out python code
jdye64 Apr 25, 2021
b7818c9
change verb to plural form to make intention more clear
jdye64 Apr 25, 2021
c57acf5
daily update checkpoint
jdye64 Apr 29, 2021
1294ee0
updated to use multiple stripes. Use stripes as input drivers
jdye64 Apr 30, 2021
3da9b29
Checkpoint; codebase compiles, except for tests and test runs
jdye64 Apr 30, 2021
39c13e3
updated cpp test to ensure vector<vector<>> is passed to options builder
jdye64 May 3, 2021
374e9e6
updates where row counts match now
jdye64 May 12, 2021
6197eb2
gather_column_info probing
jdye64 May 13, 2021
8d09d18
update orc column mapping to include hidden struct col
jdye64 May 13, 2021
3c2ab1d
merge conflicts
jdye64 May 13, 2021
124b7ec
compilation syntax error
jdye64 May 13, 2021
76a55b2
inittial chnages
rgsl888prabhu May 24, 2021
a259433
fixing style
rgsl888prabhu May 25, 2021
38aa4d8
Support for users specify more than a single ORC file to read to cudf…
jdye64 May 25, 2021
17ba956
Introduce insertion indexes for situations where the same stripe migh…
jdye64 May 26, 2021
b2eb976
changes
rgsl888prabhu May 27, 2021
cc66a23
updates which prevent illegal device memory access when a user specif…
jdye64 May 27, 2021
906a246
preservation commit before upstream merge in case of rollback needed
jdye64 May 27, 2021
3419416
Merge remote-tracking branch 'upstream/branch-21.08' into orc_list_files
jdye64 May 27, 2021
a4eddc3
changes
rgsl888prabhu May 27, 2021
e4f19f1
changes
rgsl888prabhu Jun 2, 2021
b7926fe
Several tests passing now, but still a few that do not, checkpoint co…
jdye64 Jun 3, 2021
aee5f76
list multiple rougroup breaking
rgsl888prabhu Jun 3, 2021
fce67b1
Merge remote-tracking branch 'upstream/branch-21.08' into orc_list_files
jdye64 Jun 4, 2021
599730a
updates for all test but decimal32 working
jdye64 Jun 4, 2021
7a64f5f
uncommented all tests
jdye64 Jun 4, 2021
422757b
add new line to end of file that vscode had removed
jdye64 Jun 4, 2021
2c8a629
nested list works
rgsl888prabhu Jun 4, 2021
ac116ba
nesting and num_rows works properly
rgsl888prabhu Jun 8, 2021
a7a2124
Merge remote-tracking branch 'upstream/branch-21.08' into orc_list_files
jdye64 Jun 8, 2021
da2d898
nested table works
rgsl888prabhu Jun 9, 2021
d3c1e24
handling empty rows
rgsl888prabhu Jun 10, 2021
828c3c5
Merge remote-tracking branch 'upstream/branch-21.08' into orc_list_files
jdye64 Jun 15, 2021
b0bb41c
Merge branch 'orc_list_files' of https://github.com/jdye64/cudf into …
rgsl888prabhu Jun 15, 2021
a358293
Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …
rgsl888prabhu Jun 15, 2021
14b6046
changes before merge
jdye64 Jun 16, 2021
8847efe
upstream merge
jdye64 Jun 16, 2021
0aff037
cleaning
rgsl888prabhu Jun 16, 2021
095c959
changes
rgsl888prabhu Jun 16, 2021
9d09350
Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …
rgsl888prabhu Jun 16, 2021
a9126e0
re-enable orc tests
jdye64 Jun 16, 2021
ce6bf65
Updates for filter files and stripes when multiple input sources are …
jdye64 Jun 17, 2021
08640b7
all the previous tests pass
rgsl888prabhu Jun 17, 2021
29fbc28
Merge branch 'orc_list_files' of https://github.com/jdye64/cudf into …
rgsl888prabhu Jun 17, 2021
99f830f
Merge remote-tracking branch 'upstream/branch-21.08' into orc_list_files
jdye64 Jun 17, 2021
5b159fd
Fix byteRle issue with null mask
rgsl888prabhu Jun 18, 2021
f9651f4
adding test cases
rgsl888prabhu Jun 21, 2021
32a0f3a
Modified read_orc to accept a single stripes list and expand it
jdye64 Jun 22, 2021
58dfdd0
cleaning
rgsl888prabhu Jun 22, 2021
3da6663
partial review fix
jdye64 Jun 23, 2021
dc41fdc
address remaining review comments
jdye64 Jun 23, 2021
9df73d5
Merge remote-tracking branch 'upstream/branch-21.08' into orc_list_files
jdye64 Jun 23, 2021
f27a930
update to thrust::pair from std::pair
jdye64 Jun 23, 2021
4984cc6
remove unneeded stripe_idx_in_source variable
jdye64 Jun 23, 2021
45c538e
review updates
jdye64 Jun 23, 2021
086b043
use assert_eq
jdye64 Jun 23, 2021
af94db0
compare gdf to pdf
jdye64 Jun 23, 2021
5295c26
update multiple input files python test to include num_rows as well
jdye64 Jun 23, 2021
39cbe71
Merge remote-tracking branch 'upstream/branch-21.08' into orc_list_files
jdye64 Jun 23, 2021
6f39bad
fix documentation typo
jdye64 Jun 23, 2021
ddee11c
cleaning
rgsl888prabhu Jun 23, 2021
6f204ce
test to remove fixture and create dataset with 2 stripes
rgsl888prabhu Jun 23, 2021
3e4575c
Merge remote-tracking branch 'upstream/branch-21.08' into orc_list_files
jdye64 Jun 24, 2021
7206a74
Reenabled example Cmake CUDA due to cmake 3.22 bug
jdye64 Jun 24, 2021
89300ff
cleaning
rgsl888prabhu Jun 24, 2021
e7e3b86
Merge branch 'orc_list_files' of https://github.com/jdye64/cudf into …
rgsl888prabhu Jun 25, 2021
35f3708
Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …
rgsl888prabhu Jun 25, 2021
82bffe7
fix failing test and cleaning
rgsl888prabhu Jun 25, 2021
f46fdb4
Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …
rgsl888prabhu Jun 25, 2021
19fbaa5
add exclusive scan and cleaning
rgsl888prabhu Jun 25, 2021
d3e9198
row group fix
rgsl888prabhu Jul 2, 2021
61a6eb0
addressing review changes
rgsl888prabhu Jul 2, 2021
e5920c2
review changes to add 2d span
rgsl888prabhu Jul 3, 2021
87f4ce8
Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …
rgsl888prabhu Jul 3, 2021
4eb0af4
review changes
rgsl888prabhu Jul 5, 2021
b1d9483
review changes
rgsl888prabhu Jul 5, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 6 additions & 17 deletions cpp/src/io/orc/orc.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,7 @@ void ProtobufReader::skip_struct_field(int t)
case PB_TYPE_FIXED64: skip_bytes(8); break;
case PB_TYPE_FIXEDLEN: skip_bytes(get<uint32_t>()); break;
case PB_TYPE_FIXED32: skip_bytes(4); break;
default:
// printf("invalid type (%d)\n", t);
break;
default: break;
}
}

Expand Down Expand Up @@ -471,20 +469,11 @@ void metadata::init_column_names() const
auto const &types = ff.types;
for (int32_t col_id = 0; col_id < get_num_columns(); ++col_id) {
std::string col_name;
uint32_t parent_idx = col_id;
uint32_t idx = col_id;
do {
idx = parent_idx;
parent_idx = (idx < types.size()) ? static_cast<uint32_t>(schema_idxs[idx].parent) : ~0;
if (parent_idx >= types.size()) break;

auto const field_idx =
(parent_idx < types.size()) ? static_cast<uint32_t>(schema_idxs[idx].field) : ~0;
if (field_idx < types[parent_idx].fieldNames.size()) {
col_name =
types[parent_idx].fieldNames[field_idx] + (col_name.empty() ? "" : ("." + col_name));
}
} while (parent_idx != idx);
uint32_t parent_idx = static_cast<uint32_t>(schema_idxs[col_id].parent);
uint32_t field_idx = static_cast<uint32_t>(schema_idxs[col_id].field);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we don't need to check if these are static_cast<uint32_t>(-1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if condition in the next line will take care of that scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but the logic here is implicit/fragile. The invalid value (-1) is only covered in the next line because of unsigned integer underflow.
IMO there should be a validity check for schema_idxs[col_id].field (and maybe schema_idxs[col_id].parent, not sure) before we compare against fieldNames.size() and potentially set the column name.

if (field_idx < types[parent_idx].fieldNames.size()) {
col_name = types[parent_idx].fieldNames[field_idx];
}
// If we have no name (root column), generate a name
column_names.push_back(col_name.empty() ? "col" + std::to_string(col_id) : col_name);
}
Expand Down
26 changes: 26 additions & 0 deletions cpp/src/io/orc/orc.h
Original file line number Diff line number Diff line change
Expand Up @@ -537,6 +537,32 @@ class OrcDecompressor {
std::vector<uint8_t> m_buf;
};

/**
* @brief Stores orc id for each column and its adjacent number of children
* in case of struct or number of children in case of list column.
* If list column has struct column, then all child columns of that struct are treated as child
* column of list.
*
* @code{.pseudo}
* Consider following data where a struct has two members and a list column
* {"struct": [{"a": 1, "b": 2}, {"a":3, "b":5}], "list":[[1, 2], [2, 3]]}
*
* `orc_column_meta` for struct column would be
* id = 0
* num_children = 2
*
* `orc_column_meta` for list column would be
* id = 3
* num_children = 1
* @endcode
*
*/
struct orc_column_meta {
orc_column_meta(uint32_t _id, uint32_t _num_children) : id(_id), num_children(_num_children){};
rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved
uint32_t id; // orc id for the column
uint32_t num_children; // number of children at the same level of nesting in case of struct
};

/**
* @brief A helper class for ORC file metadata. Provides some additional
* convenience methods for initializing and accessing metadata.
Expand Down
11 changes: 5 additions & 6 deletions cpp/src/io/orc/orc_gpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,9 @@ struct ColumnDesc {
uint32_t *valid_map_base; // base pointer of valid bit map for this column
void *column_data_base; // base pointer of column data
uint32_t start_row; // starting row of the stripe
uint32_t num_rows; // starting row of the stripe
uint32_t num_rows; // number of rows in stripe
uint32_t column_num_rows; // number of rows in whole column
uint32_t num_child_rows; // store number of child rows if its list column
rgsl888prabhu marked this conversation as resolved.
Show resolved Hide resolved
uint32_t dictionary_start; // start position in global dictionary
uint32_t dict_len; // length of local dictionary
uint32_t null_count; // number of null values in this stripe's column
Expand All @@ -110,6 +112,7 @@ struct RowGroup {
uint32_t chunk_id; // Column chunk this entry belongs to
uint32_t strm_offset[2]; // Index offset for CI_DATA and CI_DATA2 streams
uint16_t run_pos[2]; // Run position for CI_DATA and CI_DATA2
bool valid_row_group; // To check if it is a valid rowgroup
};

/**
Expand Down Expand Up @@ -237,15 +240,13 @@ void ParseRowGroupIndex(RowGroup *row_groups,
* @param[in] global_dictionary Global dictionary device array
* @param[in] num_columns Number of columns
* @param[in] num_stripes Number of stripes
* @param[in] max_rows Maximum number of rows to load
* @param[in] first_row Crop all rows below first_row
* @param[in] stream CUDA stream to use, default `rmm::cuda_stream_default`
*/
void DecodeNullsAndStringDictionaries(ColumnDesc *chunks,
DictionaryEntry *global_dictionary,
uint32_t num_columns,
uint32_t num_stripes,
size_t max_rows = ~0,
size_t first_row = 0,
rmm::cuda_stream_view stream = rmm::cuda_stream_default);

Expand All @@ -256,7 +257,6 @@ void DecodeNullsAndStringDictionaries(ColumnDesc *chunks,
* @param[in] global_dictionary Global dictionary device array
* @param[in] num_columns Number of columns
* @param[in] num_stripes Number of stripes
* @param[in] max_rows Maximum number of rows to load
* @param[in] first_row Crop all rows below first_row
* @param[in] tz_table Timezone translation table
* @param[in] tz_len Length of timezone translation table
Expand All @@ -265,11 +265,10 @@ void DecodeNullsAndStringDictionaries(ColumnDesc *chunks,
* @param[in] rowidx_stride Row index stride
* @param[in] stream CUDA stream to use, default `rmm::cuda_stream_default`
*/
void DecodeOrcColumnData(ColumnDesc const *chunks,
void DecodeOrcColumnData(ColumnDesc *chunks,
DictionaryEntry *global_dictionary,
uint32_t num_columns,
uint32_t num_stripes,
size_t max_rows = ~0,
size_t first_row = 0,
timezone_table_view tz_table = {},
const RowGroup *row_groups = 0,
Expand Down
Loading