Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement a templated parquet decoding kernel suitable for reuse in micro-kernel optimization approach. #14953

Open
nvdbaranec opened this issue Feb 1, 2024 · 3 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request Performance Performance related issue

Comments

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Feb 1, 2024

As part of the drive towards implementing the micro-kernel parquet decoding strategy, we would like to start centralizing the core parquet decoding loop into a generic templated implementation that can be reused. At the high level, all of various parquet kernels are structured similar to this:

kernel(PageInfo p)
{
    // page setup, bounds checking (for skip_rows/num_rows), etc
    setup_code();

   while(there are still values to decode in p){
      def_levels = def_stream.decode(def_levels);
      rep_levels = p.has_lists ? rep_stream.decode(rep_levels);
      dict_indices = p.has_dict ? dict_stream.decode(dict_indices);
      decode_general_outputs(def_levels, rep_levels, dict_indices);

      PROCESS(p, def_levels, rep_levels, dict_indices);
   }
}

The various *_stream.decode() functions are the key bottleneck in decoding parquet data. At the moment, the kernels we have mostly utilize the older/slower way of decoding these streams. The rle_stream class was developed to do this in a more parallel (and more confiurable) way, but only a few kernels use it at the moment because it does not currently handle dictionaries. The work for that is underway and very close to completion (#14950)

decode_general_outputs is a function that produces validity, list offset information and the mapping of source data (location in the parquet data page) to destination data (location in the final cudf column). The amount of work this function has to do varies greatly based on the characteristics of the input data - nullability, presence of lists, etc.

PROCESS is something that varies from kernel-to-kernel. Essentially, the user-provided function that actually does the final data decoding.

We would like to implement this high level loop as a templated function that can be tailored to produce multiple, more optimal kernels based on they key data characteristics. For example:

template<// page data characteristics
                bool nullable,
                bool has_lists,
                bool has_dictionary,
                etc

                // parameters which can be tuned 
                int decode_buffer_size,
                int decode_warp_count,
                etc,
                
                // user provided PROCESS functor
                ProcessFunc Proc>

There are several reasons to do this:

  • The rle_stream class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.
  • It is useful to be able to tune block size per kernel as they tend to get bottlenecked in different ways.
  • It would allow us to eliminate the old level decoding path.

The first candidates for using this would be two new micro-kernels: Fixed-width and fixed-width-with-dictionaries (the non-list case for both of them). We would like to get these in for 24.04 and then later on we can start refactoring the larger mass of existing kernels (especially the general-purpose gpuDecodePageData and gpuDecodeStringPageData

@nvdbaranec nvdbaranec added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Performance Performance related issue tech debt labels Feb 1, 2024
@nvdbaranec nvdbaranec self-assigned this Feb 1, 2024
@nvdbaranec
Copy link
Contributor Author

@mattahrens @abellina

@nvdbaranec
Copy link
Contributor Author

@etseidl

@vuule
Copy link
Contributor

vuule commented Feb 6, 2024

This makes sense to me. Sounds beneficial even if we don't apply the pattern to all kernels.
Could you help me understand the following?

The rle_stream class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.

Is this explaining the benefit over rle_stream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Performance Performance related issue
Projects
Status: In progress
Development

No branches or pull requests

8 participants