Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve ORC reader filtering and performance #13882

Open
GregoryKimball opened this issue Aug 15, 2023 · 0 comments
Open

[FEA] Improve ORC reader filtering and performance #13882

GregoryKimball opened this issue Aug 15, 2023 · 0 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 15, 2023

Background

libcudf includes readers and writers for two popular binary formats for columnar data: Apache Parquet and Apache ORC. These formats were originally introduced in 2013, and both have open source specifications (ORC, PQ) and reference implementations (ORC, PQ) maintained by Apache. ORC also serves as the foundation for Meta’s variant DWRF and their new format "Alpha".

Both formats have hierarchical data layouts, support encoding and compression, include fully-featured type systems, and find widespread use in database systems and data warehousing. Please refer to this paper by Zeng et al for a detailed comparison of the concepts, features and performance of Parquet and ORC binary formats. Please note that Parquet files are composed of “row groups” (~128 MB) and “pages” (~1 MB), and ORC files are composed of “stripes” (~70 MB) and “row groups” (10K rows).

Some of the differences include:

  • finer granularity in data buffers by default in ORC (better for filtered IO and targeted lookups)
  • finer granularity in bloom filters in ORC (supported at "row group" level in ORC, but not at the "page" level in Parquet)
  • Dremel-encoding for list types in Parquet (faster decoding for >8 levels of nesting)
  • support for ACID transaction tables in ORC datasets (enabling data updates without full re-write)
  • In Parquet the data "page" is also the unit of encoding and compression, whereas in ORC each encoding "stream" and "compression chunk" often includes multiple "row groups".

Expanding functionality of the ORC reader

The libcudf Parquet reader has gained functionality in key areas, including the chunked reader (release 22.12) to control how much of a table is materialized, and AST-based filtering (release 23.08) to avoid reading row groups that aren’t needed. Filtered IO (including bloom filters) is even more important to ORC users thanks to the fine granularity of ORC row groups (10k rows per row group). We should align our Parquet and ORC reader designs and separate shared utilities from format-specific details wherever possible.

Topic Status Notes
Add AST-based stripe filtering to the ORC reader #13348 added AST-based row group filtering to the Parquet reader. For this topic, we should accept an AST filter parameter, use it to determine matches stripes, read only those strips, and then post-filter the rows in the resulting table. We already have a read_raw_orc_statistics function to support these steps. We may refactor some of the AST + min/max stats tools to utilities. Also see issue #12512
Add chunked reader for ORC See #12228 about this topic from Spark-RAPIDS. Chunked readers are useful because they allow for partial materialization of tables from their binary representation. #11867 added chunking for Parquet decoding, which means the compressed row groups were fully read and decompressed and then decoded up to a requested size in bytes. (tbd) is extending chunking to include Parquet decompression as well. Chunking helps libcudf applications avoid two limits: the size_type limit on row count and the GPU working memory limit for each worker
Support bloom filters in ORC reader See #4410. Due to ORC’s common usage for data lookup and filtered IO, supporting bloom filters in reads is especially important for ORC. This feature would allow the caller to specify equality conditions and check against ORC bloom filters.
Support index roundtripping in ORC See #8708, a request from cuDF-python to preserve the index when writing+reading a file

Performance optimizations for binary format reading

Topic Status Notes
Optimize ORC reader performance for list data #13708 We observed poor performance with singlely-nested lists and high row counts
Optimize ORC reader performance for decimal data See #13251, we need a parallel algorithm to replace the single-thread decoding of the variable-width encoded representation
Evaluate multi-kernel decoding in ORC See #13622 for experiments with multiple decode kernels, and #13302 for an example of a specialized strings decode kernel
Experiment with pipelining ORC reads See #13828 for information about reader pipelining
@GregoryKimball GregoryKimball added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Aug 15, 2023
@GregoryKimball GregoryKimball moved this to Story Issue in libcudf Aug 15, 2023
@GregoryKimball GregoryKimball changed the title [FEA] Story - Accelerate language model pretraining n/a Aug 16, 2023
@GregoryKimball GregoryKimball changed the title n/a FEA - Issue 13882 Aug 16, 2023
@GregoryKimball GregoryKimball closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2023
@GregoryKimball GregoryKimball removed this from the Language model acceleration milestone Aug 16, 2023
@GregoryKimball GregoryKimball removed the status in libcudf Aug 16, 2023
@GregoryKimball GregoryKimball changed the title FEA - Issue 13882 FEA - Improve ORC reader filtering and performance Sep 10, 2023
@GregoryKimball GregoryKimball moved this to Story Issue in libcudf Sep 10, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed 2 - In Progress Currently a work in progress strings strings issues (C++ and Python) labels Sep 10, 2023
@GregoryKimball GregoryKimball changed the title FEA - Improve ORC reader filtering and performance [FEA] - Improve ORC reader filtering and performance Sep 10, 2023
@GregoryKimball GregoryKimball changed the title [FEA] - Improve ORC reader filtering and performance [FEA] Improve ORC reader filtering and performance Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Story Issue
Development

No branches or pull requests

1 participant