You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of the drive towards implementing the micro-kernel parquet decoding strategy, we would like to start centralizing the core parquet decoding loop into a generic templated implementation that can be reused. At the high level, all of various parquet kernels are structured similar to this:
The various *_stream.decode() functions are the key bottleneck in decoding parquet data. At the moment, the kernels we have mostly utilize the older/slower way of decoding these streams. The rle_stream class was developed to do this in a more parallel (and more confiurable) way, but only a few kernels use it at the moment because it does not currently handle dictionaries. The work for that is underway and very close to completion (#14950)
decode_general_outputs is a function that produces validity, list offset information and the mapping of source data (location in the parquet data page) to destination data (location in the final cudf column). The amount of work this function has to do varies greatly based on the characteristics of the input data - nullability, presence of lists, etc.
PROCESS is something that varies from kernel-to-kernel. Essentially, the user-provided function that actually does the final data decoding.
We would like to implement this high level loop as a templated function that can be tailored to produce multiple, more optimal kernels based on they key data characteristics. For example:
template<// page data characteristics
bool nullable,
bool has_lists,
bool has_dictionary,
etc
// parameters which can be tuned
int decode_buffer_size,
int decode_warp_count,
etc,
// user provided PROCESS functor
ProcessFunc Proc>
There are several reasons to do this:
The rle_stream class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.
It is useful to be able to tune block size per kernel as they tend to get bottlenecked in different ways.
It would allow us to eliminate the old level decoding path.
The first candidates for using this would be two new micro-kernels: Fixed-width and fixed-width-with-dictionaries (the non-list case for both of them). We would like to get these in for 24.04 and then later on we can start refactoring the larger mass of existing kernels (especially the general-purpose gpuDecodePageData and gpuDecodeStringPageData
The text was updated successfully, but these errors were encountered:
This makes sense to me. Sounds beneficial even if we don't apply the pattern to all kernels.
Could you help me understand the following?
The rle_stream class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.
As part of the drive towards implementing the micro-kernel parquet decoding strategy, we would like to start centralizing the core parquet decoding loop into a generic templated implementation that can be reused. At the high level, all of various parquet kernels are structured similar to this:
The various *_stream.decode() functions are the key bottleneck in decoding parquet data. At the moment, the kernels we have mostly utilize the older/slower way of decoding these streams. The
rle_stream
class was developed to do this in a more parallel (and more confiurable) way, but only a few kernels use it at the moment because it does not currently handle dictionaries. The work for that is underway and very close to completion (#14950)decode_general_outputs
is a function that produces validity, list offset information and the mapping of source data (location in the parquet data page) to destination data (location in the final cudf column). The amount of work this function has to do varies greatly based on the characteristics of the input data - nullability, presence of lists, etc.PROCESS is something that varies from kernel-to-kernel. Essentially, the user-provided function that actually does the final data decoding.
We would like to implement this high level loop as a templated function that can be tailored to produce multiple, more optimal kernels based on they key data characteristics. For example:
There are several reasons to do this:
rle_stream
class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.The first candidates for using this would be two new micro-kernels: Fixed-width and fixed-width-with-dictionaries (the non-list case for both of them). We would like to get these in for 24.04 and then later on we can start refactoring the larger mass of existing kernels (especially the general-purpose
gpuDecodePageData
andgpuDecodeStringPageData
The text was updated successfully, but these errors were encountered: