Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Parquet reader #12046

Merged
merged 172 commits into from
Nov 10, 2022
Merged
Show file tree
Hide file tree
Changes from 168 commits
Commits
Show all changes
172 commits
Select commit Hold shift + click to select a range
c2e8e87
Fix an issue where using num_rows and skip_rows on a parquet file con…
nvdbaranec Sep 23, 2022
f330431
Merge branch 'branch-22.12' into reader_preprocess_fix_and_opt
nvdbaranec Sep 23, 2022
eadfd63
Fixed an issue with the tests: input columns cannot have unsanitary …
nvdbaranec Sep 27, 2022
c4de038
Merge branch 'branch-22.12' into reader_preprocess_fix_and_opt
nvdbaranec Sep 27, 2022
222c9fe
Copy `parquet_reader_*` into `chunked_parquet_reader_*`
ttnghia Sep 30, 2022
f49cfed
Modify `chunked_parquet_reader_options`
ttnghia Sep 30, 2022
dd39804
Exploit inheritance to extend the options and options_builder classes
ttnghia Oct 4, 2022
81bc68f
Remove unnecessary variable
ttnghia Oct 5, 2022
f8126be
Misc
ttnghia Oct 5, 2022
0e7692c
Add docs
ttnghia Oct 5, 2022
9f9eeb0
PR feedback changes.
nvdbaranec Oct 5, 2022
9b3ea62
Merge branch 'branch-22.12' into reader_preprocess_fix_and_opt
nvdbaranec Oct 5, 2022
d2e409a
Fixed some compile errors from merging.
nvdbaranec Oct 5, 2022
ed41ac1
Add `chunked_parquet_reader`
ttnghia Oct 5, 2022
be782f2
Add empty implementation
ttnghia Oct 5, 2022
7908b66
Merge branch 'branch-22.12' into parquet_reader
ttnghia Oct 5, 2022
a7175c8
Add a destructor and `close`
ttnghia Oct 5, 2022
63a7bd6
Update docs
ttnghia Oct 6, 2022
16c12d9
Fix comment
ttnghia Oct 6, 2022
cd85385
Construct `chunked_parquet_reader`
ttnghia Oct 6, 2022
5944beb
Add comment
ttnghia Oct 6, 2022
7cfa72a
Rename function and implementing
ttnghia Oct 7, 2022
4696bd3
MISC
ttnghia Oct 7, 2022
99dc786
Bare bones implementation. Many types still not working.
nvdbaranec Oct 10, 2022
ad9c399
Merge branch 'branch-22.12' into parquet_reader
ttnghia Oct 11, 2022
ecf225d
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia Oct 11, 2022
583a7ef
Add test
ttnghia Oct 11, 2022
b250d6f
Cleanup
ttnghia Oct 11, 2022
e7a9e3e
Modify docs
ttnghia Oct 11, 2022
811354a
Cleanup
ttnghia Oct 11, 2022
12ba72e
Add TODO
ttnghia Oct 11, 2022
45668ff
Add `read_intermediate_data`
ttnghia Oct 11, 2022
1bb8254
Use `read_intermediate_data`
ttnghia Oct 11, 2022
b1c44dd
Merge branch 'branch-22.12' into parquet_reader
ttnghia Oct 12, 2022
56715ef
Fix bug
ttnghia Oct 12, 2022
a7e7e93
Simplify code
ttnghia Oct 12, 2022
8fe87b1
Implement `file_intermediate_data`
ttnghia Oct 12, 2022
464f4f9
Add `make_output`
ttnghia Oct 12, 2022
56756d6
Implement `read_chunk`
ttnghia Oct 12, 2022
3044ac5
Cleanup
ttnghia Oct 12, 2022
ffb8a19
Fix bug when `skip_rows` and `num_rows` are modified inside a called …
ttnghia Oct 12, 2022
baf3603
Fix comment
ttnghia Oct 13, 2022
8bdab44
Store preprocess data
ttnghia Oct 13, 2022
ec4abfb
Implement `chunked_reader` detail class
ttnghia Oct 13, 2022
cb1dea4
Refactoring
ttnghia Oct 13, 2022
a8dfd82
Rename structs
ttnghia Oct 13, 2022
7889e5a
Increment `current_read_chunk`
ttnghia Oct 13, 2022
63a6511
Call preprocessing in `read_chunk`
ttnghia Oct 13, 2022
c1269d1
Fix `has_next`
ttnghia Oct 14, 2022
95e6c1d
Refactoring
ttnghia Oct 14, 2022
bd7b510
Fix errors
ttnghia Oct 14, 2022
eb78526
Merge branch 'branch-22.12' into chunked_reader_gpu. Also: work to…
nvdbaranec Oct 17, 2022
66aeaf4
Change param
ttnghia Oct 18, 2022
1d700e3
Rename variables
ttnghia Oct 18, 2022
4af948b
Remove intermediate variables
ttnghia Oct 18, 2022
28cfc6f
Modify tests
ttnghia Oct 18, 2022
8135ed5
First pass of string support.
nvdbaranec Oct 18, 2022
fbeabfc
Fix bug
ttnghia Oct 18, 2022
df074e0
Remove debug print
ttnghia Oct 18, 2022
5653090
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia Oct 18, 2022
3071fd7
Merge branch 'branch-22.12' into parquet_reader
ttnghia Oct 18, 2022
631eff1
Fix tests
ttnghia Oct 18, 2022
d1b4e4c
Fix chunk size limit
ttnghia Oct 18, 2022
3f2f8a4
Turn back to do preprocess once
ttnghia Oct 18, 2022
974e7ef
The read limit parameter is now no longer const but truely runtime pa…
ttnghia Oct 19, 2022
0be096b
Add new test file
ttnghia Oct 19, 2022
f7018fe
Reverse `parquet_test.cpp`
ttnghia Oct 19, 2022
81097eb
Modify `read` to add exception and preprocess once
ttnghia Oct 19, 2022
fcffac8
Rewrite tests
ttnghia Oct 19, 2022
43dd802
Store `decomp_page_data`
ttnghia Oct 19, 2022
eeec023
Rewrite tests
ttnghia Oct 19, 2022
14dfd3f
Simple test
ttnghia Oct 19, 2022
66e9f09
Store `raw_page_data`
ttnghia Oct 19, 2022
669b8cf
Cleanup test
ttnghia Oct 19, 2022
001c6c7
Fix empty output
ttnghia Oct 19, 2022
f50603a
Add `preprocess_file_and_columns`
ttnghia Oct 19, 2022
66976aa
Misc
ttnghia Oct 19, 2022
0b0040a
Fixed some incorrect logic in preprocess tep.
nvdbaranec Oct 19, 2022
467de78
Removed debug stuff. Added some comments.
nvdbaranec Oct 19, 2022
d68bf80
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia Oct 20, 2022
1888aa3
Merge branch 'branch-22.12' into parquet_reader
ttnghia Oct 20, 2022
5861747
Change function
ttnghia Oct 20, 2022
721c052
Disable debug printing
ttnghia Oct 20, 2022
7cda8c2
Fixed an issue with non-first reads in the chunked reader. Made an a…
nvdbaranec Oct 20, 2022
6bc073d
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia Oct 20, 2022
2aef5cc
Fix off-by-one bug
ttnghia Oct 20, 2022
89ca1a6
Merge branch 'branch-22.12' into parquet_reader
ttnghia Oct 20, 2022
5245b9b
Fix an issue related to aliased output pointers in the chunked read c…
nvdbaranec Oct 20, 2022
62283c7
Do not keep reference---copy object instead
ttnghia Oct 21, 2022
44424a2
Optimization: don't do any decoding or page size computation for pag…
nvdbaranec Oct 24, 2022
95df356
Merge branch 'branch-22.12' into chunked_reader_gpu
nvdbaranec Oct 24, 2022
445db9b
Fix build issue for spark-rapids-jni
nvdbaranec Oct 24, 2022
db56908
Cleanup: Remove `chunked_parquet_reader_options` and `chunked_parquet…
ttnghia Oct 24, 2022
ef5eaee
Move `preprocess_file` into `reader_preprocess.cu`
ttnghia Oct 24, 2022
d19260d
Move common implementation into `reader_impl_helpers.*`
ttnghia Oct 24, 2022
6569d62
Cleanup
ttnghia Oct 24, 2022
0a1e2c3
Merge branch 'branch-22.12' into parquet_reader
ttnghia Oct 25, 2022
04e1320
More cleanup
ttnghia Oct 25, 2022
395413d
Rewrite docs for `parquet.hpp` files
ttnghia Oct 25, 2022
e3e19e8
Extract functions for `reader` and `chunked_reader`
ttnghia Oct 25, 2022
52339da
Fix issues with string length computation.
nvdbaranec Oct 25, 2022
f7e8694
Merge branch 'nghia_parquet_reader' into chunked_reader_gpu
nvdbaranec Oct 25, 2022
83fa31a
Remove redundant changes
ttnghia Oct 25, 2022
02ccdec
Add simple structs test
ttnghia Oct 25, 2022
ee0ffad
Rewrite tests
ttnghia Oct 25, 2022
09a89e4
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia Oct 25, 2022
034a5b7
Merge branch 'branch-22.12' into parquet_reader
ttnghia Oct 25, 2022
c149d64
Add lists test
ttnghia Oct 25, 2022
9335bb7
MISC
ttnghia Oct 25, 2022
3769fff
Cleanup comments
ttnghia Oct 25, 2022
dc9ef5c
Construct output table metadata just once
ttnghia Oct 25, 2022
0366b7a
Construct `_output_columns` just once
ttnghia Oct 26, 2022
ea2fe9c
Remove `options` member variable
ttnghia Oct 26, 2022
1804056
Make the chunked_read_limit a soft limit - if we can't find a split, …
nvdbaranec Oct 26, 2022
5671c36
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia Oct 26, 2022
3d6e13b
Add tests for structs of lists and lists of structs
ttnghia Oct 26, 2022
826c46f
Fixed an issue in split generation code causing indexing off the end …
nvdbaranec Oct 26, 2022
36c1972
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia Oct 26, 2022
d32c602
Merge branch 'nghia_parquet_reader' into chunked_reader_gpu
nvdbaranec Oct 26, 2022
88ca034
Just reformat
ttnghia Oct 26, 2022
d76ee06
Change variable names in tests
ttnghia Oct 26, 2022
cf7b786
Merge branch 'nghia_parquet_reader' into chunked_reader_gpu
nvdbaranec Oct 27, 2022
b806323
Optimization: store off global nesting sizes per page so that during…
nvdbaranec Oct 27, 2022
0c2178d
Adding doxygen, refactoring and cleaning up
ttnghia Oct 27, 2022
f759103
Merge branch 'branch-22.12' into parquet_reader
ttnghia Oct 27, 2022
c2bf7f5
Fixed issues with list, and validity size calculations.
nvdbaranec Oct 27, 2022
e7e74c5
More refactoring
ttnghia Oct 27, 2022
ee9edbc
Merge branch 'chunked_reader_gpu' into parquet_reader
ttnghia Oct 27, 2022
6fd3e90
Add more tests
ttnghia Oct 27, 2022
9dda980
Add test with empty data
ttnghia Oct 28, 2022
bb35f9f
Add tests
ttnghia Oct 28, 2022
59166cf
Rewrite null tests
ttnghia Oct 28, 2022
eb6e996
Add more extreme tests
ttnghia Oct 28, 2022
1c56794
Rewrite tests to generate input files just once
ttnghia Oct 28, 2022
4777419
Fix tests with structs of lists
ttnghia Oct 28, 2022
aedc37a
Handle nulls for more complex types
ttnghia Oct 28, 2022
c00eb3c
Fix another nulls handling bug for strings
ttnghia Oct 28, 2022
4d24f88
Simplify the null purging process
ttnghia Oct 28, 2022
b15bb39
Cleanup.
nvdbaranec Oct 31, 2022
7e38a56
Merge branch 'nghia_parquet_reader' into chunked_reader_gpu
nvdbaranec Oct 31, 2022
321815d
Fleshed out list-of-structs and struct-of-lists tests.
nvdbaranec Oct 31, 2022
8cf95e8
Docs and cleanup.
nvdbaranec Oct 31, 2022
af35c4d
Update doxygen
ttnghia Oct 31, 2022
e31a767
Remove new test file
ttnghia Nov 1, 2022
a232241
Remove chunked reader
ttnghia Nov 1, 2022
c3d0097
Remove chunked reading
ttnghia Nov 1, 2022
e7808f9
Reversing changes...
ttnghia Nov 1, 2022
9da46c7
Fix page decoding
ttnghia Nov 1, 2022
c77893e
Remove `chunk_intermediate_data`
ttnghia Nov 1, 2022
f63b0b5
Cleanup
ttnghia Nov 1, 2022
78cd349
Merge branch 'branch-22.12' into refactor_parquet_reader
ttnghia Nov 1, 2022
fd4df91
Correct `const`
ttnghia Nov 2, 2022
8198ce8
Move `allocate_nesting_info`
ttnghia Nov 2, 2022
2cc8235
Add doxygen
ttnghia Nov 2, 2022
aa47f23
Clean up `reader_impl.hpp`
ttnghia Nov 2, 2022
f560ab7
Add doxygen
ttnghia Nov 2, 2022
e7f80ec
Add `const`
ttnghia Nov 2, 2022
1c0308a
Type alias for `load_and_decompress_data`
ttnghia Nov 2, 2022
c791faa
Revert "Move `allocate_nesting_info`"
ttnghia Nov 2, 2022
1503674
Cleanup `allocate_nesting_info`
ttnghia Nov 2, 2022
5440662
Further cleanup
ttnghia Nov 2, 2022
78e3f10
Cleanup headers
ttnghia Nov 2, 2022
cbdaf52
Merge branch 'branch-22.12' into refactor_parquet_reader
ttnghia Nov 3, 2022
c35aebe
Extract device code out of `parquet_gpu.hpp`
ttnghia Nov 3, 2022
aa64f81
Change `std::vector<> const&` to `span<>`
ttnghia Nov 4, 2022
a4547bf
Merge branch 'branch-22.12' into refactor_parquet_reader
ttnghia Nov 4, 2022
e85b1da
Merge branch 'branch-22.12' into refactor_parquet_reader
ttnghia Nov 4, 2022
2a37e5c
Merge branch 'branch-22.12' into refactor_parquet_reader
ttnghia Nov 7, 2022
453a1b7
Remove `using namespace`
ttnghia Nov 8, 2022
bdb15be
Add `nodiscard` and fix `const`
ttnghia Nov 8, 2022
dcdeee8
Fix input check
ttnghia Nov 8, 2022
72c888f
Merge branch 'branch-22.12' into refactor_parquet_reader
ttnghia Nov 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,10 @@ add_library(
src/io/parquet/chunk_dict.cu
src/io/parquet/page_enc.cu
src/io/parquet/page_hdr.cu
src/io/parquet/reader_impl.cu
src/io/parquet/reader.cpp
src/io/parquet/reader_impl.cpp
src/io/parquet/reader_impl_helpers.cpp
src/io/parquet/reader_impl_preprocess.cu
src/io/parquet/writer_impl.cu
src/io/statistics/orc_column_statistics.cu
src/io/statistics/parquet_column_statistics.cu
Expand Down
21 changes: 11 additions & 10 deletions cpp/include/cudf/io/detail/parquet.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,25 +30,28 @@
#include <string>
#include <vector>

namespace cudf {
namespace io {
namespace cudf::io {

// Forward declaration
class parquet_reader_options;
class parquet_writer_options;
class chunked_parquet_writer_options;

namespace detail {
namespace parquet {
namespace detail::parquet {

/**
* @brief Class to read Parquet dataset data into columns.
*/
class reader {
private:
protected:
class impl;
std::unique_ptr<impl> _impl;

/**
* @brief Default constructor, needed for subclassing.
*/
reader();

public:
/**
* @brief Constructor from an array of datasources
Expand All @@ -66,7 +69,7 @@ class reader {
/**
* @brief Destructor explicitly-declared to avoid inlined in header
*/
~reader();
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
virtual ~reader();

/**
* @brief Reads the dataset as per given options.
Expand Down Expand Up @@ -154,7 +157,5 @@ class writer {
const std::vector<std::unique_ptr<std::vector<uint8_t>>>& metadata_list);
};

}; // namespace parquet
}; // namespace detail
}; // namespace io
}; // namespace cudf
} // namespace detail::parquet
} // namespace cudf::io
7 changes: 3 additions & 4 deletions cpp/include/cudf/io/parquet.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,7 @@
#include <string>
#include <vector>

namespace cudf {
namespace io {
namespace cudf::io {
/**
* @addtogroup io_readers
* @{
Expand Down Expand Up @@ -1452,5 +1451,5 @@ class parquet_chunked_writer {
};

/** @} */ // end of group
} // namespace io
} // namespace cudf

} // namespace cudf::io
2 changes: 1 addition & 1 deletion cpp/src/io/parquet/chunk_dict.cu
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
* limitations under the License.
*/

#include <io/parquet/parquet_gpu.hpp>
#include "parquet_gpu.cuh"

#include <cudf/detail/iterator.cuh>
#include <cudf/detail/utilities/cuda.cuh>
Expand Down
3 changes: 2 additions & 1 deletion cpp/src/io/parquet/page_enc.cu
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "parquet_gpu.hpp"

#include "parquet_gpu.cuh"

#include <io/utilities/block_utils.cuh>

Expand Down
85 changes: 85 additions & 0 deletions cpp/src/io/parquet/parquet_gpu.cuh
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include "parquet_gpu.hpp"

#include <cudf/lists/lists_column_device_view.cuh>
#include <cudf/types.hpp>

#include <cuco/static_map.cuh>

namespace cudf::io::parquet::gpu {

auto constexpr KEY_SENTINEL = size_type{-1};
auto constexpr VALUE_SENTINEL = size_type{-1};

using map_type = cuco::static_map<size_type, size_type>;

/**
* @brief The alias of `map_type::pair_atomic_type` class.
*
* Declare this struct by trivial subclassing instead of type aliasing so we can have forward
* declaration of this struct somewhere else.
*/
struct slot_type : public map_type::pair_atomic_type {
};
PointKernel marked this conversation as resolved.
Show resolved Hide resolved

/**
* @brief Return the byte length of parquet dtypes that are physically represented by INT32
*/
inline uint32_t __device__ int32_logical_len(type_id id)
{
switch (id) {
case cudf::type_id::INT8: [[fallthrough]];
case cudf::type_id::UINT8: return 1;
case cudf::type_id::INT16: [[fallthrough]];
case cudf::type_id::UINT16: return 2;
case cudf::type_id::DURATION_SECONDS: [[fallthrough]];
case cudf::type_id::DURATION_MILLISECONDS: return 8;
default: return 4;
}
}

/**
* @brief Translate the row index of a parent column_device_view into the index of the first value
* in the leaf child.
* Only works in the context of parquet writer where struct columns are previously modified s.t.
* they only have one immediate child.
*/
inline size_type __device__ row_to_value_idx(size_type idx,
parquet_column_device_view const& parquet_col)
{
// with a byte array, we can't go all the way down to the leaf node, but instead we want to leave
// the size at the parent level because we are writing out parent row byte arrays.
auto col = *parquet_col.parent_column;
while (col.type().id() == type_id::LIST or col.type().id() == type_id::STRUCT) {
if (col.type().id() == type_id::STRUCT) {
idx += col.offset();
col = col.child(0);
} else {
auto list_col = cudf::detail::lists_column_device_view(col);
auto child = list_col.child();
if (parquet_col.output_as_byte_array && child.type().id() == type_id::UINT8) { break; }
idx = list_col.offset_at(idx);
col = child;
}
}
return idx;
}

} // namespace cudf::io::parquet::gpu
77 changes: 17 additions & 60 deletions cpp/src/io/parquet/parquet_gpu.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,10 @@
#include "io/utilities/column_buffer.hpp"
#include "io/utilities/hostdevice_vector.hpp"

#include <cudf/column/column_device_view.cuh>
#include <cudf/lists/lists_column_device_view.cuh>
#include <cudf/table/table_device_view.cuh>
#include <cudf/io/datasource.hpp>
#include <cudf/types.hpp>
#include <cudf/utilities/span.hpp>

#include <cuco/static_map.cuh>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_scalar.hpp>
#include <rmm/device_uvector.hpp>
Expand All @@ -39,9 +35,7 @@

#include <vector>

namespace cudf {
namespace io {
namespace parquet {
namespace cudf::io::parquet {

using cudf::io::detail::string_index_pair;

Expand Down Expand Up @@ -72,11 +66,6 @@ struct input_column_info {

namespace gpu {

auto constexpr KEY_SENTINEL = size_type{-1};
auto constexpr VALUE_SENTINEL = size_type{-1};
using map_type = cuco::static_map<size_type, size_type>;
using slot_type = map_type::pair_atomic_type;

/**
* @brief Enums for the flags in the page header
*/
Expand Down Expand Up @@ -108,7 +97,8 @@ struct PageNestingInfo {
int32_t max_rep_level;

// set during preprocessing
int32_t size; // this page/nesting-level's size contribution to the output column
int32_t size; // this page/nesting-level's row count contribution to the output column, if fully
// decoded
int32_t page_start_value; // absolute output start index in output column data

// set during data decoding
Expand Down Expand Up @@ -247,6 +237,17 @@ struct ColumnChunkDesc {
int32_t src_col_schema; // my schema index in the file
};

/**
* @brief The struct to store raw/intermediate file data before parsing.
*/
struct file_intermediate_data {
std::vector<std::unique_ptr<datasource::buffer>> raw_page_data;
rmm::device_buffer decomp_page_data;
hostdevice_vector<gpu::ColumnChunkDesc> chunks{};
hostdevice_vector<gpu::PageInfo> pages_info{};
hostdevice_vector<gpu::PageNestingInfo> page_nesting_info{};
};
Comment on lines +243 to +249
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables are used to store temporary data in the old read() function.


/**
* @brief Struct describing an encoder column
*/
Expand Down Expand Up @@ -293,50 +294,8 @@ struct PageFragment {
constexpr unsigned int kDictHashBits = 16;
constexpr size_t kDictScratchSize = (1 << kDictHashBits) * sizeof(uint32_t);

/**
* @brief Return the byte length of parquet dtypes that are physically represented by INT32
*/
inline uint32_t __device__ int32_logical_len(type_id id)
{
switch (id) {
case cudf::type_id::INT8: [[fallthrough]];
case cudf::type_id::UINT8: return 1;
case cudf::type_id::INT16: [[fallthrough]];
case cudf::type_id::UINT16: return 2;
case cudf::type_id::DURATION_SECONDS: [[fallthrough]];
case cudf::type_id::DURATION_MILLISECONDS: return 8;
default: return 4;
}
}

/**
* @brief Translate the row index of a parent column_device_view into the index of the first value
* in the leaf child.
* Only works in the context of parquet writer where struct columns are previously modified s.t.
* they only have one immediate child.
*/
inline size_type __device__ row_to_value_idx(size_type idx,
parquet_column_device_view const& parquet_col)
{
// with a byte array, we can't go all the way down to the leaf node, but instead we want to leave
// the size at the parent level because we are writing out parent row byte arrays.
auto col = *parquet_col.parent_column;
while (col.type().id() == type_id::LIST or col.type().id() == type_id::STRUCT) {
if (col.type().id() == type_id::STRUCT) {
idx += col.offset();
col = col.child(0);
} else {
auto list_col = cudf::detail::lists_column_device_view(col);
auto child = list_col.child();
if (parquet_col.output_as_byte_array && child.type().id() == type_id::UINT8) { break; }
idx = list_col.offset_at(idx);
col = child;
}
}
return idx;
}

struct EncPage;
struct slot_type;

/**
* @brief Struct describing an encoder column chunk
Expand Down Expand Up @@ -630,6 +589,4 @@ void EncodeColumnIndexes(device_span<EncColumnChunk> chunks,
rmm::cuda_stream_view stream);

} // namespace gpu
} // namespace parquet
} // namespace io
} // namespace cudf
} // namespace cudf::io::parquet
43 changes: 43 additions & 0 deletions cpp/src/io/parquet/reader.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "reader_impl.hpp"

namespace cudf::io::detail::parquet {

reader::reader() = default;

reader::reader(std::vector<std::unique_ptr<datasource>>&& sources,
parquet_reader_options const& options,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
: _impl(std::make_unique<impl>(std::move(sources), options, stream, mr))
{
}

reader::~reader() = default;

table_with_metadata reader::read(parquet_reader_options const& options)
{
// if the user has specified custom row bounds
bool const uses_custom_row_bounds = options.get_num_rows() >= 0 || options.get_skip_rows() != 0;
return _impl->read(options.get_skip_rows(),
options.get_num_rows(),
uses_custom_row_bounds,
options.get_row_groups());
}

} // namespace cudf::io::detail::parquet
Loading