Simplified reading parquet #532

jorgecarleitao · 2021-10-16T07:24:13Z

This PR simplifies the code to read parquet, making it a bit more future proof and opening the doors to improve performance in writing by re-using buffers (improvements upstream).

I do not observe differences in performance (vs main) in the following parquet configurations:

single page vs multiple pages
compressed vs uncompressed
different types

It generates flamegraphs that imo are quite optimized:

This corresponds to

cargo flamegraph --features io_parquet,io_parquet_compression \
    --example parquet_read fixtures/pyarrow3/v1/multi/snappy/benches_1048576.parquet \
    1 0

i.e. reading a f64 column from a single row group with 1M rows with a page size of 1Mb (default in pyarrow).

(same but for a utf8 column (column index 2))

The majority of the time is used deserializing the data to arrow, which means that the main gains to have continue to be on that front.

Backward incompatible

The API to read parquet now uses FallibleStreamingIterator instead of StreamingIterator (of Result<Page>). As before, we re-export these APIs in io::parquet::read.
The API to write parquet now expects the user to decompress the pages. This is only relevant when not using RowGroupIterator (i.e. in parallelizing). This is now enforced by the type system (DataPage vs CompressedDataPage), so that we do not get it wrong.

codecov · 2021-10-16T07:35:52Z

Codecov Report

Merging #532 (56c98c7) into main (13f8d09) will decrease coverage by 0.01%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##             main     #532      +/-   ##
==========================================
- Coverage   81.28%   81.26%   -0.02%     
==========================================
  Files         380      380              
  Lines       23432    23376      -56     
==========================================
- Hits        19046    18997      -49     
+ Misses       4386     4379       -7

Impacted Files	Coverage Δ
src/io/parquet/mod.rs	`0.00% <0.00%> (ø)`
src/io/parquet/write/binary/basic.rs	`84.72% <ø> (-0.42%)`	⬇️
src/io/parquet/write/binary/nested.rs	`94.73% <ø> (-0.92%)`	⬇️
src/io/parquet/write/boolean/basic.rs	`97.43% <ø> (-0.13%)`	⬇️
src/io/parquet/write/boolean/nested.rs	`94.73% <ø> (-0.92%)`	⬇️
src/io/parquet/write/fixed_len_bytes.rs	`97.22% <ø> (+6.52%)`	⬆️
src/io/parquet/write/primitive/basic.rs	`95.23% <ø> (-0.22%)`	⬇️
src/io/parquet/write/primitive/nested.rs	`94.73% <ø> (-0.92%)`	⬇️
src/io/parquet/write/utf8/basic.rs	`91.66% <ø> (-0.34%)`	⬇️
src/io/parquet/write/utf8/nested.rs	`94.73% <ø> (-0.92%)`	⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13f8d09...56c98c7. Read the comment docs.

ritchie46 · 2021-10-16T08:25:18Z

Any opportunity for simdutf8 in validation of utf8?

Do you think we could parallize the pages (I believe it were the pages, correct me if I am wrong) decoding similar to what was done in for writing? Maybe accept a closure that takes the pages and returns them decode ones?

jorgecarleitao · 2021-10-17T06:37:33Z

After jorgecarleitao/parquet2#63 we get a nice flamegraph composed of 3 main blocks:

thrift deserialization
decompression
deserialization

Dandandan · 2021-10-17T07:25:02Z

Any opportunity for simdutf8 in validation of utf8?

@jorgecarleitao is the flame graph taken with or without the simdutf8 feature on? I think it's disabled or individual strings are smaller than the simdutf8 boundary.

jorgecarleitao · 2021-10-17T07:27:04Z

Sorry, what do you mean? There is no simdutf8 feature: is always on.

Dandandan · 2021-10-17T18:14:05Z

Sorry, what do you mean? There is no simdutf8 feature: is always on.

You're right. I suppose the from_utf8 frome core comes from the fallback case in simdutf8 which depends on string length.

ritchie46 · 2021-10-18T06:33:02Z

You're right. I suppose the from_utf8 frome core comes from the fallback case in simdutf8 which depends on string length.

In the csv-parser, I circumvent this by doing the simd utf8 check at the end on the whole [u8] buffer. Maybe that's an option here as well?

Dandandan · 2021-10-18T06:38:16Z

You're right. I suppose the from_utf8 frome core comes from the fallback case in simdutf8 which depends on string length.

In the csv-parser, I circumvent this by doing the simd utf8 check at the end on the whole [u8] buffer. Maybe that's an option here as well?

Is that enough though? Indivually invalid UTF-8 strings might be made valid UTF-8 on the whole buffer (I am not sure though)?

ritchie46 · 2021-10-18T06:46:58Z

Is that enough though? Indivually invalid UTF-8 strings might be made valid UTF-8 on the whole buffer (I am not sure though)?

Yeah, come to think of it. 🤔

jorgecarleitao · 2021-10-18T06:49:53Z

I do not think it is sufficient, but I can't find an example. Asked SO for guidance.

jorgecarleitao · 2021-10-18T06:57:41Z

The answer seems to be no:

fn main() {
    let a = "π";  // valid utf8
    let a = a.as_bytes(); // [207, 128]
    assert!(std::str::from_utf8(&a[..1]).is_ok());
}

ritchie46 · 2021-10-18T13:14:24Z

I did a local benchmark with the csv-parser, and found that delaying and then verifying in a short loop had a lot of wins.

Best case: no simd validation

 Performance counter stats for 'target/release/memcheck':

         33.047,24 msec task-clock                #   10,120 CPUs utilized          
             8.085      context-switches          #    0,245 K/sec                  
               167      cpu-migrations            #    0,005 K/sec                  
         1.044.339      page-faults               #    0,032 M/sec                  
   117.592.547.005      cycles                    #    3,558 GHz                    
   188.107.619.210      instructions              #    1,60  insn per cycle         
    36.279.636.823      branches                  # 1097,811 M/sec                  
       136.945.502      branch-misses             #    0,38% of all branches        

       3,265390685 seconds time elapsed

      30,962444000 seconds user
       2,082338000 seconds sys

Immediate validation

 Performance counter stats for 'target/release/memcheck':

         41.289,61 msec task-clock                #   10,303 CPUs utilized          
            10.193      context-switches          #    0,247 K/sec                  
               134      cpu-migrations            #    0,003 K/sec                  
         1.065.864      page-faults               #    0,026 M/sec                  
   147.827.020.236      cycles                    #    3,580 GHz                    
   226.553.476.805      instructions              #    1,53  insn per cycle         
    45.155.132.062      branches                  # 1093,620 M/sec                  
       333.913.124      branch-misses             #    0,74% of all branches        

       4,007702461 seconds time elapsed

      39,149625000 seconds user
       2,133354000 seconds sys

Delayed validation: short loop

 Performance counter stats for 'target/release/memcheck':

         37.373,08 msec task-clock                #   10,390 CPUs utilized          
             8.738      context-switches          #    0,234 K/sec                  
                97      cpu-migrations            #    0,003 K/sec                  
         1.295.653      page-faults               #    0,035 M/sec                  
   133.258.060.058      cycles                    #    3,566 GHz                    
   224.312.115.282      instructions              #    1,68  insn per cycle         
    46.320.508.217      branches                  # 1239,408 M/sec                  
       220.777.819      branch-misses             #    0,48% of all branches        

       3,597132052 seconds time elapsed

      34,999141000 seconds user
       2,374095000 seconds sys

Example of the code used.

  let mut is_valid = true;
  if delay_utf8_validation(v.encoding, v.ignore_errors) {
      let mut start = 0usize;

      for &end in &v.offsets[1..] {
          let slice= v.data.get_unchecked(start..end as usize);
          start = end as usize;
          is_valid &= simdutf8::basic::from_utf8(slice).is_ok();
      }

      if !is_valid {
          return Err(PolarsError::ComputeError("invalid utf8 data in csv".into()))
      }
  }

jorgecarleitao force-pushed the parquet_compression branch from 9e992f2 to b755944 Compare October 17, 2021 06:24

jorgecarleitao added 5 commits October 18, 2021 05:25

Moved compression allocation one level up.

b45866d

Decoupled encoding from compression.

81e0461

Temp

d3946e8

Bumped parquet

4ae27b0

Bumped to release parquet2

56c98c7

jorgecarleitao force-pushed the parquet_compression branch from b755944 to 56c98c7 Compare October 18, 2021 05:26

jorgecarleitao added the backwards-incompatible label Oct 18, 2021

jorgecarleitao merged commit fdbaa18 into main Oct 18, 2021

jorgecarleitao deleted the parquet_compression branch October 18, 2021 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplified reading parquet #532

Simplified reading parquet #532

jorgecarleitao commented Oct 16, 2021 •

edited

Loading

codecov bot commented Oct 16, 2021 •

edited

Loading

ritchie46 commented Oct 16, 2021

jorgecarleitao commented Oct 17, 2021

Dandandan commented Oct 17, 2021

jorgecarleitao commented Oct 17, 2021

Dandandan commented Oct 17, 2021

ritchie46 commented Oct 18, 2021

Dandandan commented Oct 18, 2021

ritchie46 commented Oct 18, 2021

jorgecarleitao commented Oct 18, 2021

jorgecarleitao commented Oct 18, 2021

ritchie46 commented Oct 18, 2021 •

edited

Loading

Simplified reading parquet #532

Simplified reading parquet #532

Conversation

jorgecarleitao commented Oct 16, 2021 • edited Loading

Backward incompatible

codecov bot commented Oct 16, 2021 • edited Loading

Codecov Report

ritchie46 commented Oct 16, 2021

jorgecarleitao commented Oct 17, 2021

Dandandan commented Oct 17, 2021

jorgecarleitao commented Oct 17, 2021

Dandandan commented Oct 17, 2021

ritchie46 commented Oct 18, 2021

Dandandan commented Oct 18, 2021

ritchie46 commented Oct 18, 2021

jorgecarleitao commented Oct 18, 2021

jorgecarleitao commented Oct 18, 2021

ritchie46 commented Oct 18, 2021 • edited Loading

Best case: no simd validation

Immediate validation

Delayed validation: short loop

jorgecarleitao commented Oct 16, 2021 •

edited

Loading

codecov bot commented Oct 16, 2021 •

edited

Loading

ritchie46 commented Oct 18, 2021 •

edited

Loading