Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

alamb · 2024-06-07T12:37:09Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

As part of #5853 we are considering ways to improve parquet metadata decoding speed in the context of files with wide schemas (e.g. 10,000 columns)

One common observation is that many queries only require a small subset of the columns, but because of how standard thrift decoders are implemented, they must decode the entire metadata even if only a subset of the columns is needed

Due to the Apache Thrift variable length encoding, the decoder likely still requires scanning the entire metadata, but there is no need to create rust structs for fields that will not be read.

Simply skipping such fields I think would likely result in substantial savings. Some evidence for this is @jhorstmann's prototype that avoids copying structs off the stack results in 2x performance improvements. See #5775 (comment)

Thus we could likely optimize the decoding of metadata for large schemas even more by selectively decoding only the fields needed. This idea also described at a high level here: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4

Describe the solution you'd like
Implement some sort of projection pushdown when decoding metadata. Perhaps we could add a projection argument to this API https://docs.rs/parquet/latest/parquet/file/footer/fn.decode_metadata.html

Describe alternatives you've considered

Additional context

XiangpengHao · 2024-06-07T13:34:23Z

A low-hanging fruit would be selectively converting thrift row groups to parquet-rs row groups. Currently, we build all row group metadata, which takes 30% of the decoding time (as will shown in the #5770 blog post).

arrow-rs/parquet/src/file/footer.rs

Lines 99 to 101 in 087f34b

    
           for rg in t_file_metadata.row_groups { 
        
               row_groups.push(RowGroupMetaData::from_thrift(schema_descr.clone(), rg)?); 
        
           }

If we can filter out the row group early based on the column stats, we can easily skip those row groups.

But this requires changing the decode API, and potentially how we open a parquet file... which is not clear to me

arrow-rs/parquet/src/file/footer.rs

Line 54 in 087f34b

    
           pub fn parse_metadata<R: ChunkReader>(chunk_reader: &R) -> Result<ParquetMetaData> {

alamb · 2024-06-07T14:18:38Z

A low-hanging fruit would be selectively decoding row groups.

Nice -- I updated the title to reflect this

XiangpengHao · 2024-06-07T14:26:39Z

I thought about an alternative (but similar) approach to Pinterest's solution -- instead of decoding and building in-memory structs along the way, we can decouple it and make it two passes.

In the first pass, decode the thrift metadata but do not build the in-memory structures (i.e., no memory allocation, etc). Instead, we only track the location of the important structures. Specifically, instead of building the column chunk structs as we see it, we track the location of the column chunk (offset to the buffer).

In the second pass -- when we actually need to read a column chunk -- we build the in-memory data structure on demand, using the offset tracked in the first pass.

This approach has the advantage of selective decoding (faster, lower memory consumption, etc.) and does not need to change the decoding API (unlike the Pinterest approach). However, it is suboptimal if we actually need to decode the entire metadata, in which case the first pass is pure overhead. Presuming that machine learning workloads (or, in general, wide tables) present high selectivity, we should still save quite a lot.

tustvold · 2024-06-07T14:43:54Z

I think a good first step would be to analyse where the current overheads actually lie, the Pinterest article says the major benefit is avoiding allocations. Whilst #5777 doesn't avoid all allocations, it does eliminate the vast majority of them. If this isn't moving the needle, it might suggest the bottleneck is something different.

IMO the two pass solution you describe is not hugely dissimilar from what that PR is doing.

XiangpengHao · 2024-06-07T14:56:59Z

Below is the flamegraph of decoding parquet metadata and allocation itself does not show up as the bottleneck.

The term "allocation" is ambiguous, it can refer to allocation operations, it can also mean excessive memory footprint (and the efforts to set them up), and I believe the later is the bottleneck.

More formally, I believe the time is spent on two types of tasks: (1) decoding, interpreting the thrift data, which can be SIMD acclerated (2) setup the in-memory sturcture, i.e., inflating the 10MB metadata into 100MB of in-memory representation, which is solved by skipping columns/row groups.

tustvold · 2024-06-07T15:06:22Z

I'm curious how you intend to SIMD accelerate varint decoding, its a case study in being unfriendly to that... But optimising that aspect I would see as more obviously going to improve decode speed.

Regarding the linked flamegraph I'm not sure I'm seeing memory overheads dominating, there isn't much time being spent in allocation/memcpy routines within decode_metadata, nor are we getting decode throughput at a level where I would expect memory throughput bottlenecks to be relevant <1Gpbs.

My reading of that flamegraph is the inlined varint decoding is dominating the runtime of that benchmark

XiangpengHao · 2024-06-07T15:26:17Z

On SIMD decoding varint: https://github.com/as-com/varint-simd

Memory overhead does not show in flamegraph as it is implicit (caused by cpu memory stalls rather than function calls). It is mostly based on speculations... In #5854 (comment) I discussed a simple trick that improves the performance by up to 30%, by reducing the memory footprint and reduce implicit memory movement.

On memory bandwidth: it takes 300ms to decode 100k columns, which allocates 600MB memory, that is ~2GB per second.
Getting rid of/sharing small allocations won't help much because the majority of the memory consumption comes from ColumnChunk which is currently 420 bytes -> 420B * 100k * 10 row group = 420MB.

jhorstmann · 2024-06-07T15:32:45Z

I did not see varint decoding as a bottleneck in my benchmarks. I experimented with using BMI2 instructions, but that still requires at least one branch to check whether we can read 8 bytes at a time and fallback to sequential code if not, and for numbers larger than 8*7 bits one or two more branches. That does not make the code much smaller anymore, and assuming mostly small integers and good branch prediction there seems to be no big improvement.

I think most of the unaccounted time in the flamegraph is related to moving of data on the stack, which rust still does not optimize that well.

tustvold · 2024-06-07T15:45:43Z

Fair enough, I guess we're already much faster than I realised 😅

I think this is definitely an interesting avenue to explore, however, I remain a little hesitant about the complexity vs real-world performance benefit. If we're already at the point where memory throughput is the bottleneck, what hope does any IO subsystem realistically have of keeping up, thrift compression isn't that good. Then again maybe I'm just being a curmudgeon about how practical a 100k column file with 10 row groups even is... It'd be 1TB if each column chunk is 1MB.

alamb · 2024-06-07T17:57:32Z

To be clear, I don't have a need personally (and I don't think InfluxData has a technical need) at the moment to actually invest the engineering time to make the thrift decoding faster.

Instead, my goal of filing these tickets is to leave sufficient information / analysis for anyone for whom it is important (e.g. machine learning use cases, potential users of a "Parquet V3" ,etc) that they could undertake the work if it was actually critical to their workload

I tend to agree with @tustvold that there are likely only a few real world senarios where the system bottleneck is parquet decoding, though I am sure we can come up with them

alamb · 2024-06-17T15:40:38Z

It turns out that someone on discord found out there is a TODO in the code: https://discord.com/channels/885562378132000778/885562378132000781/1252058490512478300

That seems to hint at the ideas in this ticket

arrow-rs/parquet/src/file/footer.rs

Line 92 in 0d1511c

// TODO: row group filtering

alamb added enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate labels Jun 7, 2024

alamb mentioned this issue Jun 7, 2024

[EPIC] A collection of items to improve speed of parquet metadata encoding #5853

Open

4 tasks

alamb changed the title ~~Implement selective decoding of a subset of columns of parquet metadata~~ Implement selective decoding of a subset (e.g. columns or row groups) of parquet metadata Jun 7, 2024

alamb mentioned this issue Jul 4, 2024

API for encoding/decoding ParquetMetadata with more control #6002

Closed

alamb changed the title ~~Implement selective decoding of a subset (e.g. columns or row groups) of parquet metadata~~ Selective decoding of a subset (e.g. columns or row groups) of parquet metadata Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

alamb commented Jun 7, 2024

XiangpengHao commented Jun 7, 2024 •

edited

Loading

alamb commented Jun 7, 2024

XiangpengHao commented Jun 7, 2024 •

edited

Loading

tustvold commented Jun 7, 2024 •

edited

Loading

XiangpengHao commented Jun 7, 2024

tustvold commented Jun 7, 2024 •

edited

Loading

XiangpengHao commented Jun 7, 2024

jhorstmann commented Jun 7, 2024

tustvold commented Jun 7, 2024 •

edited

Loading

alamb commented Jun 7, 2024

alamb commented Jun 17, 2024

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

Comments

alamb commented Jun 7, 2024

XiangpengHao commented Jun 7, 2024 • edited Loading

alamb commented Jun 7, 2024

XiangpengHao commented Jun 7, 2024 • edited Loading

tustvold commented Jun 7, 2024 • edited Loading

XiangpengHao commented Jun 7, 2024

tustvold commented Jun 7, 2024 • edited Loading

XiangpengHao commented Jun 7, 2024

jhorstmann commented Jun 7, 2024

tustvold commented Jun 7, 2024 • edited Loading

alamb commented Jun 7, 2024

alamb commented Jun 17, 2024

XiangpengHao commented Jun 7, 2024 •

edited

Loading

XiangpengHao commented Jun 7, 2024 •

edited

Loading

tustvold commented Jun 7, 2024 •

edited

Loading

tustvold commented Jun 7, 2024 •

edited

Loading

tustvold commented Jun 7, 2024 •

edited

Loading