-
Notifications
You must be signed in to change notification settings - Fork 224
Conversation
Codecov Report
@@ Coverage Diff @@
## main #532 +/- ##
==========================================
- Coverage 81.28% 81.26% -0.02%
==========================================
Files 380 380
Lines 23432 23376 -56
==========================================
- Hits 19046 18997 -49
+ Misses 4386 4379 -7
Continue to review full report at Codecov.
|
Any opportunity for Do you think we could parallize the pages (I believe it were the pages, correct me if I am wrong) decoding similar to what was done in for writing? Maybe accept a closure that takes the pages and returns them decode ones? |
9e992f2
to
b755944
Compare
After jorgecarleitao/parquet2#63 we get a nice flamegraph composed of 3 main blocks:
|
@jorgecarleitao is the flame graph taken with or without the |
Sorry, what do you mean? There is no |
You're right. I suppose the |
b755944
to
56c98c7
Compare
In the csv-parser, I circumvent this by doing the simd utf8 check at the end on the whole |
Is that enough though? Indivually invalid UTF-8 strings might be made valid UTF-8 on the whole buffer (I am not sure though)? |
Yeah, come to think of it. 🤔 |
I do not think it is sufficient, but I can't find an example. Asked SO for guidance. |
The answer seems to be no: fn main() {
let a = "π"; // valid utf8
let a = a.as_bytes(); // [207, 128]
assert!(std::str::from_utf8(&a[..1]).is_ok());
} |
I did a local benchmark with the csv-parser, and found that delaying and then verifying in a short loop had a lot of wins. Best case: no simd validation
Immediate validation
Delayed validation: short loop
Example of the code used. let mut is_valid = true;
if delay_utf8_validation(v.encoding, v.ignore_errors) {
let mut start = 0usize;
for &end in &v.offsets[1..] {
let slice= v.data.get_unchecked(start..end as usize);
start = end as usize;
is_valid &= simdutf8::basic::from_utf8(slice).is_ok();
}
if !is_valid {
return Err(PolarsError::ComputeError("invalid utf8 data in csv".into()))
}
} |
This PR simplifies the code to read
parquet
, making it a bit more future proof and opening the doors to improve performance in writing by re-using buffers (improvements upstream).I do not observe differences in performance (vs main) in the following parquet configurations:
It generates flamegraphs that imo are quite optimized:
This corresponds to
i.e. reading a f64 column from a single row group with 1M rows with a page size of 1Mb (default in pyarrow).
(same but for a utf8 column (column index 2))
The majority of the time is used deserializing the data to arrow, which means that the main gains to have continue to be on that front.
Backward incompatible
FallibleStreamingIterator
instead ofStreamingIterator
(ofResult<Page>
). As before, we re-export these APIs inio::parquet::read
.RowGroupIterator
(i.e. in parallelizing). This is now enforced by the type system (DataPage
vsCompressedDataPage
), so that we do not get it wrong.