[FEA] Implement a templated parquet decoding kernel suitable for reuse in micro-kernel optimization approach. #14953

nvdbaranec · 2024-02-01T19:09:50Z

As part of the drive towards implementing the micro-kernel parquet decoding strategy, we would like to start centralizing the core parquet decoding loop into a generic templated implementation that can be reused. At the high level, all of various parquet kernels are structured similar to this:

kernel(PageInfo p)
{
    // page setup, bounds checking (for skip_rows/num_rows), etc
    setup_code();

   while(there are still values to decode in p){
      def_levels = def_stream.decode(def_levels);
      rep_levels = p.has_lists ? rep_stream.decode(rep_levels);
      dict_indices = p.has_dict ? dict_stream.decode(dict_indices);
      decode_general_outputs(def_levels, rep_levels, dict_indices);

      PROCESS(p, def_levels, rep_levels, dict_indices);
   }
}

The various *_stream.decode() functions are the key bottleneck in decoding parquet data. At the moment, the kernels we have mostly utilize the older/slower way of decoding these streams. The rle_stream class was developed to do this in a more parallel (and more confiurable) way, but only a few kernels use it at the moment because it does not currently handle dictionaries. The work for that is underway and very close to completion (#14950)

decode_general_outputs is a function that produces validity, list offset information and the mapping of source data (location in the parquet data page) to destination data (location in the final cudf column). The amount of work this function has to do varies greatly based on the characteristics of the input data - nullability, presence of lists, etc.

PROCESS is something that varies from kernel-to-kernel. Essentially, the user-provided function that actually does the final data decoding.

We would like to implement this high level loop as a templated function that can be tailored to produce multiple, more optimal kernels based on they key data characteristics. For example:

template<// page data characteristics
                bool nullable,
                bool has_lists,
                bool has_dictionary,
                etc

                // parameters which can be tuned 
                int decode_buffer_size,
                int decode_warp_count,
                etc,
                
                // user provided PROCESS functor
                ProcessFunc Proc>

There are several reasons to do this:

The rle_stream class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.
It is useful to be able to tune block size per kernel as they tend to get bottlenecked in different ways.
It would allow us to eliminate the old level decoding path.

The first candidates for using this would be two new micro-kernels: Fixed-width and fixed-width-with-dictionaries (the non-list case for both of them). We would like to get these in for 24.04 and then later on we can start refactoring the larger mass of existing kernels (especially the general-purpose gpuDecodePageData and gpuDecodeStringPageData

The text was updated successfully, but these errors were encountered:

nvdbaranec · 2024-02-01T19:10:06Z

@mattahrens @abellina

nvdbaranec · 2024-02-01T19:26:19Z

@etseidl

vuule · 2024-02-06T21:08:04Z

This makes sense to me. Sounds beneficial even if we don't apply the pattern to all kernels.
Could you help me understand the following?

The rle_stream class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.

Is this explaining the benefit over rle_stream?

nvdbaranec added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Performance Performance related issue tech debt labels Feb 1, 2024

nvdbaranec self-assigned this Feb 1, 2024

GregoryKimball added this to libcudf Feb 8, 2024

GregoryKimball moved this to In progress in libcudf Feb 8, 2024

GregoryKimball added this to the Parquet continuous improvement milestone Feb 15, 2024

GregoryKimball mentioned this issue Feb 17, 2024

[FEA] Optimization of repetition and definition level decoding in the parquet reader kernel. #12633

Closed

vyasr removed the tech debt label Feb 23, 2024

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

nvdbaranec mentioned this issue Mar 4, 2024

Add microkernels for fixed-width and fixed-width dictionary in Parquet decode #15159

Merged

3 tasks

mattahrens assigned abellina and gerashegalov Apr 11, 2024

vyasr mentioned this issue May 16, 2024

[FEA] cuIO for custom data types #5633

Closed

mattahrens assigned pmattione-nvidia and unassigned gerashegalov Jul 5, 2024

abellina removed their assignment Aug 12, 2024

mattahrens unassigned nvdbaranec Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement a templated parquet decoding kernel suitable for reuse in micro-kernel optimization approach. #14953

[FEA] Implement a templated parquet decoding kernel suitable for reuse in micro-kernel optimization approach. #14953

nvdbaranec commented Feb 1, 2024 •

edited

Loading

nvdbaranec commented Feb 1, 2024

nvdbaranec commented Feb 1, 2024

vuule commented Feb 6, 2024

[FEA] Implement a templated parquet decoding kernel suitable for reuse in micro-kernel optimization approach. #14953

[FEA] Implement a templated parquet decoding kernel suitable for reuse in micro-kernel optimization approach. #14953

Comments

nvdbaranec commented Feb 1, 2024 • edited Loading

nvdbaranec commented Feb 1, 2024

nvdbaranec commented Feb 1, 2024

vuule commented Feb 6, 2024

nvdbaranec commented Feb 1, 2024 •

edited

Loading