[FEA] Improve ORC reader filtering and performance #13882

GregoryKimball · 2023-08-15T04:13:18Z

Background

libcudf includes readers and writers for two popular binary formats for columnar data: Apache Parquet and Apache ORC. These formats were originally introduced in 2013, and both have open source specifications (ORC, PQ) and reference implementations (ORC, PQ) maintained by Apache. ORC also serves as the foundation for Meta’s variant DWRF and their new format "Alpha".

Both formats have hierarchical data layouts, support encoding and compression, include fully-featured type systems, and find widespread use in database systems and data warehousing. Please refer to this paper by Zeng et al for a detailed comparison of the concepts, features and performance of Parquet and ORC binary formats. Please note that Parquet files are composed of “row groups” (~128 MB) and “pages” (~1 MB), and ORC files are composed of “stripes” (~70 MB) and “row groups” (10K rows).

Some of the differences include:

finer granularity in data buffers by default in ORC (better for filtered IO and targeted lookups)
finer granularity in bloom filters in ORC (supported at "row group" level in ORC, but not at the "page" level in Parquet)
Dremel-encoding for list types in Parquet (faster decoding for >8 levels of nesting)
support for ACID transaction tables in ORC datasets (enabling data updates without full re-write)
In Parquet the data "page" is also the unit of encoding and compression, whereas in ORC each encoding "stream" and "compression chunk" often includes multiple "row groups".

Expanding functionality of the ORC reader

The libcudf Parquet reader has gained functionality in key areas, including the chunked reader (release 22.12) to control how much of a table is materialized, and AST-based filtering (release 23.08) to avoid reading row groups that aren’t needed. Filtered IO (including bloom filters) is even more important to ORC users thanks to the fine granularity of ORC row groups (10k rows per row group). We should align our Parquet and ORC reader designs and separate shared utilities from format-specific details wherever possible.

Topic	Status	Notes
Add AST-based stripe filtering to the ORC reader		#13348 added AST-based row group filtering to the Parquet reader. For this topic, we should accept an AST filter parameter, use it to determine matches stripes, read only those strips, and then post-filter the rows in the resulting table. We already have a `read_raw_orc_statistics` function to support these steps. We may refactor some of the AST + min/max stats tools to `utilities`. Also see issue #12512
Add chunked reader for ORC		See #12228 about this topic from Spark-RAPIDS. Chunked readers are useful because they allow for partial materialization of tables from their binary representation. #11867 added chunking for Parquet decoding, which means the compressed row groups were fully read and decompressed and then decoded up to a requested size in bytes. (tbd) is extending chunking to include Parquet decompression as well. Chunking helps libcudf applications avoid two limits: the size_type limit on row count and the GPU working memory limit for each worker
Support bloom filters in ORC reader		See #4410. Due to ORC’s common usage for data lookup and filtered IO, supporting bloom filters in reads is especially important for ORC. This feature would allow the caller to specify equality conditions and check against ORC bloom filters.
Support index roundtripping in ORC		See #8708, a request from cuDF-python to preserve the index when writing+reading a file

Performance optimizations for binary format reading

Topic	Status	Notes
Optimize ORC reader performance for list data	✅ #13708	We observed poor performance with singlely-nested lists and high row counts
Optimize ORC reader performance for decimal data		See #13251, we need a parallel algorithm to replace the single-thread decoding of the variable-width encoded representation
Evaluate multi-kernel decoding in ORC		See #13622 for experiments with multiple decode kernels, and #13302 for an example of a specialized strings decode kernel
Experiment with pipelining ORC reads		See #13828 for information about reader pipelining

The text was updated successfully, but these errors were encountered:

GregoryKimball added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Aug 15, 2023

GregoryKimball added this to the Language model acceleration milestone Aug 15, 2023

GregoryKimball added this to libcudf Aug 15, 2023

GregoryKimball moved this to Story Issue in libcudf Aug 15, 2023

GregoryKimball changed the title ~~[FEA] Story - Accelerate language model pretraining~~ n/a Aug 16, 2023

GregoryKimball changed the title ~~n/a~~ FEA - Issue 13882 Aug 16, 2023

GregoryKimball closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2023

GregoryKimball removed this from the Language model acceleration milestone Aug 16, 2023

GregoryKimball removed the status in libcudf Aug 16, 2023

GregoryKimball changed the title ~~FEA - Issue 13882~~ FEA - Improve ORC reader filtering and performance Sep 10, 2023

GregoryKimball moved this to Story Issue in libcudf Sep 10, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed 2 - In Progress Currently a work in progress strings strings issues (C++ and Python) labels Sep 10, 2023

GregoryKimball added this to the ORC continuous improvement milestone Sep 10, 2023

GregoryKimball reopened this Sep 10, 2023

GregoryKimball changed the title ~~FEA - Improve ORC reader filtering and performance~~ [FEA] - Improve ORC reader filtering and performance Sep 10, 2023

GregoryKimball changed the title ~~[FEA] - Improve ORC reader filtering and performance~~ [FEA] Improve ORC reader filtering and performance Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve ORC reader filtering and performance #13882

[FEA] Improve ORC reader filtering and performance #13882

GregoryKimball commented Aug 15, 2023 •

edited

Loading

[FEA] Improve ORC reader filtering and performance #13882

[FEA] Improve ORC reader filtering and performance #13882

Comments

GregoryKimball commented Aug 15, 2023 • edited Loading

Background

Expanding functionality of the ORC reader

Performance optimizations for binary format reading

GregoryKimball commented Aug 15, 2023 •

edited

Loading