Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] In parquet writer, use column_device_view in EncColumnDesc #6893

Closed
devavret opened this issue Dec 3, 2020 · 3 comments · Fixed by #7097
Closed

[FEA] In parquet writer, use column_device_view in EncColumnDesc #6893

devavret opened this issue Dec 3, 2020 · 3 comments · Fixed by #7097
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@devavret
Copy link
Contributor

devavret commented Dec 3, 2020

EncColumnDesc currently inherits from stats_column_desc which holds pointers to data and validity mask as well as a cuIO specific dtype which basically repeats the list of datatypes supported by cudf.

struct stats_column_desc {
statistics_dtype stats_dtype; //!< physical data type of column
uint32_t num_rows; //!< number of rows in column
uint32_t num_values; //!< Number of data values in column. Different from num_rows in case of
//!< nested columns
const uint32_t *valid_map_base; //!< base of valid bit map for this column (null if not present)
const void *column_data_base; //!< base ptr to column data
int32_t ts_scale; //!< timestamp scale (>0: multiply by scale, <0: divide by -scale)
};

This is the same information stored by cudf::column_device_view and therefore can be replaced by it/contain it.
This gives us a few benefits:

  1. Automatic application of offset to data and nullmask ptr. (The latter didn't exist; although, it's being added in Fix nullmask offset handling in parquet and orc writer #6889)
  2. Helpful accessors for is_null(i) which de-duplicates functionality.
  3. element<T>() accessor, which allows for the following:
    1. On the fly construction of string_view in case of string columns, thus obviating the need to construct a vector of string_views (or, nvstrdesc_s, as it is right now) at the time of construction of parquet_column_view. This limitation exists because string columns contain two data streams (offsets and chars) while stats_column_desc only has pointer to one data stream. This would also automatically resolve [FEA] Replace nvstrdesc with string_view #5682
    2. Automatic construction of decimal32/64 using the scale stored per column.
    3. Automatic construction of dictionary32 using indexalator.
    4. Future proofing any other data type added that needs a custom accessor.
  4. Encapsulation of nesting offset data
    size_type const *const
    *nesting_offsets; //!< If column is a nested type, contains offset array of each nesting level
    size_type nesting_levels; //!< Number of nesting levels in column. 0 means no nesting.
    which currently incurs a memcpy at the time of parquet_column_view construction.
    while (curr_col.type().id() == type_id::LIST) {
    lists_column_view list_col(curr_col);
    offsets_array.push_back(list_col.offsets().data<size_type>());
    curr_col = list_col.child();
    }
    _offsets_array = offsets_array;

The construction of all these column_device_views might cause many cudaMemcpys if done using column_device_view::create. Since write_parquet and write_chunk are called with a table_view, it'd be better to call table_device_view::create and use a kernel to copy them into EncColumnDesc.

Because of list columns, we'd also want a second column_device_view member in EncColumnDesc so that we can have both the parent column and the leaf column. The leaf column would be the main column for non-list types and the parent column would only be populated for list types.

@devavret devavret added feature request New feature or request Needs Triage Need team to review and classify code quality cuIO cuIO issue labels Dec 3, 2020
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Dec 3, 2020
@vuule
Copy link
Contributor

vuule commented Dec 4, 2020

What about other uses of stats_column_desc? Should we replace it everywhere with column_device_view?

@devavret
Copy link
Contributor Author

devavret commented Dec 4, 2020

What about other uses of stats_column_desc? Should we replace it everywhere with column_device_view?

As discussed offline, for now we should move away from EncColumnDesc inheriting from stats_column_desc and have a column_device_view member instead. If/when we resolve #6920, stats_column_desc can be removed.

@jrhemstad
Copy link
Contributor

The construction of all these column_device_views might cause many cudaMemcpys if done using column_device_view::create

The functionality added in #6605 should be able to help with this.

@rapids-bot rapids-bot bot closed this as completed in #7097 Feb 4, 2021
rapids-bot bot pushed a commit that referenced this issue Feb 4, 2021
Closes #6893, closes #6894. Contributes to #5682 

Reduce usage of stats_column_desc members in Parquet writer with column_device_view members.

Authors:
  - Kumar Aatish (@kaatish)

Approvers:
  - David (@davidwendt)
  - Vukasin Milovanovic (@vuule)
  - Devavret Makkar (@devavret)

URL: #7097
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants