You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
EncColumnDesc currently inherits from stats_column_desc which holds pointers to data and validity mask as well as a cuIO specific dtype which basically repeats the list of datatypes supported by cudf.
Helpful accessors for is_null(i) which de-duplicates functionality.
element<T>() accessor, which allows for the following:
On the fly construction of string_view in case of string columns, thus obviating the need to construct a vector of string_views (or, nvstrdesc_s, as it is right now) at the time of construction of parquet_column_view. This limitation exists because string columns contain two data streams (offsets and chars) while stats_column_desc only has pointer to one data stream. This would also automatically resolve [FEA] Replace nvstrdesc with string_view #5682
Automatic construction of decimal32/64 using the scale stored per column.
Automatic construction of dictionary32 using indexalator.
Future proofing any other data type added that needs a custom accessor.
The construction of all these column_device_views might cause many cudaMemcpys if done using column_device_view::create. Since write_parquet and write_chunk are called with a table_view, it'd be better to call table_device_view::create and use a kernel to copy them into EncColumnDesc.
Because of list columns, we'd also want a second column_device_view member in EncColumnDesc so that we can have both the parent column and the leaf column. The leaf column would be the main column for non-list types and the parent column would only be populated for list types.
The text was updated successfully, but these errors were encountered:
What about other uses of stats_column_desc? Should we replace it everywhere with column_device_view?
As discussed offline, for now we should move away from EncColumnDesc inheriting from stats_column_desc and have a column_device_view member instead. If/when we resolve #6920, stats_column_desc can be removed.
EncColumnDesc
currently inherits fromstats_column_desc
which holds pointers to data and validity mask as well as a cuIO specific dtype which basically repeats the list of datatypes supported by cudf.cudf/cpp/src/io/statistics/column_stats.h
Lines 40 to 48 in 8cc23bd
This is the same information stored by
cudf::column_device_view
and therefore can be replaced by it/contain it.This gives us a few benefits:
is_null(i)
which de-duplicates functionality.element<T>()
accessor, which allows for the following:string_view
in case of string columns, thus obviating the need to construct a vector ofstring_view
s (or,nvstrdesc_s
, as it is right now) at the time of construction ofparquet_column_view
. This limitation exists because string columns contain two data streams (offsets and chars) whilestats_column_desc
only has pointer to one data stream. This would also automatically resolve [FEA] Replace nvstrdesc with string_view #5682decimal32/64
using the scale stored per column.dictionary32
using indexalator.cudf/cpp/src/io/parquet/parquet_gpu.hpp
Lines 225 to 227 in 2ce59fc
parquet_column_view
construction.cudf/cpp/src/io/parquet/writer_impl.cu
Lines 325 to 330 in 2ce59fc
The construction of all these
column_device_view
s might cause manycudaMemcpy
s if done usingcolumn_device_view::create
. Sincewrite_parquet
andwrite_chunk
are called with atable_view
, it'd be better to calltable_device_view::create
and use a kernel to copy them intoEncColumnDesc
.Because of list columns, we'd also want a second
column_device_view
member inEncColumnDesc
so that we can have both the parent column and the leaf column. The leaf column would be the main column for non-list types and the parent column would only be populated for list types.The text was updated successfully, but these errors were encountered: