Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify parquet arror RecordReader #1021

Merged
merged 4 commits into from
Dec 13, 2021

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 10, 2021

Which issue does this PR close?

Closes #1020. Related to #171 (better performance reading dictionary encoded strings)

Rationale for this change

See ticket

What changes are included in this PR?

This alters RecordReader to remove some shared mutable state, along with the concept of being in the middle of a record.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 10, 2021
@codecov-commenter
Copy link

codecov-commenter commented Dec 10, 2021

Codecov Report

Merging #1021 (cd0f759) into master (e0abda2) will decrease coverage by 0.00%.
The diff coverage is 82.60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1021      +/-   ##
==========================================
- Coverage   82.31%   82.30%   -0.01%     
==========================================
  Files         168      168              
  Lines       49031    49026       -5     
==========================================
- Hits        40359    40350       -9     
- Misses       8672     8676       +4     
Impacted Files Coverage Δ
parquet/src/arrow/record_reader.rs 92.77% <82.60%> (-0.96%) ⬇️
parquet/src/encodings/encoding.rs 93.52% <0.00%> (-0.20%) ⬇️
arrow/src/array/transform/mod.rs 85.10% <0.00%> (-0.14%) ⬇️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (ø)
arrow/src/datatypes/datatype.rs 66.38% <0.00%> (+0.42%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0abda2...cd0f759. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Dec 10, 2021

Filed #1022 to track CI failure in "nightly" builds

@alamb
Copy link
Contributor

alamb commented Dec 10, 2021

I fixed the nightly failures in #1023 -- will merge to this PR to get that to pass too

@alamb
Copy link
Contributor

alamb commented Dec 10, 2021

I think we should run the parquet performance benchmark for this change -- I will do so

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code carefully and looks good to me. I am running the benchmarks on a GCP machine and will report the numbers shortly

@@ -75,9 +73,7 @@ impl<T: DataType> RecordReader<T> {
column_desc: column_schema,
num_records: 0,
num_values: 0,
values_seen: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields look like they have been here since the initial implementation by @liurenjie1024 in apache/arrow#4292

@alamb
Copy link
Contributor

alamb commented Dec 10, 2021

My performance tests showed no significant performance difference

  1. tustvold/simplify-record-reader @ 290b24f
  2. apache/master @ e0abda2

Test command

cargo bench -p parquet --bench arrow_array_reader --features=test_common -- --save-baseline <name>

Result:

alamb@instance-1:/data/arrow-rs$ critcmp master1 simplify-record-reader1
group                                                                                  master1                                simplify-record-reader1
-----                                                                                  -------                                -----------------------
arrow_array_reader/read Int32Array, dictionary encoded, mandatory, no NULLs - new      1.00    109.2±0.33µs        ? ?/sec    1.00    109.0±0.31µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, mandatory, no NULLs - old      1.00     37.5±0.15µs        ? ?/sec    1.00     37.6±0.19µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, optional, half NULLs - new     1.02    279.6±0.59µs        ? ?/sec    1.00    275.0±1.73µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, optional, half NULLs - old     1.00    258.4±0.74µs        ? ?/sec    1.12    290.2±1.68µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, optional, no NULLs - new       1.01    132.6±0.39µs        ? ?/sec    1.00    130.9±0.66µs        ? ?/sec
arrow_array_reader/read Int32Array, dictionary encoded, optional, no NULLs - old       1.00    126.4±0.74µs        ? ?/sec    1.02    128.9±0.57µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, mandatory, no NULLs - new           1.00      3.6±0.18µs        ? ?/sec    1.03      3.7±0.23µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, mandatory, no NULLs - old           1.00      5.8±0.40µs        ? ?/sec    1.01      5.8±0.41µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, optional, half NULLs - new          1.03    225.6±0.93µs        ? ?/sec    1.00    219.8±1.14µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, optional, half NULLs - old          1.00    242.1±0.84µs        ? ?/sec    1.13    272.6±1.12µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, optional, no NULLs - new            1.05     26.8±0.32µs        ? ?/sec    1.00     25.4±0.32µs        ? ?/sec
arrow_array_reader/read Int32Array, plain encoded, optional, no NULLs - old            1.00     96.0±0.86µs        ? ?/sec    1.02     98.2±1.41µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, mandatory, no NULLs - new     1.00    155.6±1.06µs        ? ?/sec    1.01    157.1±1.18µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, mandatory, no NULLs - old     1.00   1201.5±3.71µs        ? ?/sec    1.00   1197.0±4.59µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - new    1.00    358.7±1.41µs        ? ?/sec    1.00    358.4±2.82µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - old    1.01   1086.9±3.57µs        ? ?/sec    1.00   1080.2±3.87µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, optional, no NULLs - new      1.01    181.7±1.17µs        ? ?/sec    1.00    179.4±1.03µs        ? ?/sec
arrow_array_reader/read StringArray, dictionary encoded, optional, no NULLs - old      1.01   1273.7±7.95µs        ? ?/sec    1.00   1265.2±8.85µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, mandatory, no NULLs - new          1.00    176.6±0.96µs        ? ?/sec    1.01    177.6±1.47µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, mandatory, no NULLs - old          1.00   1377.9±7.25µs        ? ?/sec    1.02   1399.2±5.47µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - new         1.00    380.6±1.63µs        ? ?/sec    1.00    380.0±2.58µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - old         1.00   1179.4±4.56µs        ? ?/sec    1.00   1180.2±4.77µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, optional, no NULLs - new           1.00    206.2±1.45µs        ? ?/sec    1.00    205.2±1.94µs        ? ?/sec
arrow_array_reader/read StringArray, plain encoded, optional, no NULLs - old           1.00  1452.5±17.75µs        ? ?/sec    1.00   1445.8±5.62µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Dec 10, 2021

It looks like this PR needs some clippy appeasement: https://github.com/apache/arrow-rs/runs/4485244206?check_suite_focus=true

But otherwise looks good from my perspective

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice simplification @tustvold 👍 I didn't see any discernable performance difference.

cc @nevi-me @andygrove @sunchao

@alamb alamb changed the title Simplify record reader Simplify parquet arror RecordReader Dec 10, 2021
let (record_count, value_count) =
self.count_records(num_records - records_read);

self.num_records += record_count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we can update this only once before returning from the method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would leave RecordReader in a strange state if read_one_batch returned an error, as self.num_values would have been updated and not self.num? I can't pull self.num_values out to match as it is used by count_records.

let mut end_of_last_record = self.num_values;

for current in self.num_values..self.values_written {
if buf[current] == 0 && current != end_of_last_record {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what if you haven't finished the current repeated list, and it continues to the next batch? seems we'll return here and count as if the repeated list has been read completely (since we'll increment the records_read here)?

Copy link
Contributor Author

@tustvold tustvold Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you haven't finished the current repeated list

I'm not sure I follow, buf[current] == 0 implies we've reached the end of the list. Perhaps it would be more clear if the second condition were current != self.num_values it's only false on the first iteration? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, my bad. yea this looks OK. I think the downside is we could potentially read a batch of repLevels multiple times if, say, the repLevels are all non-zero values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also strange that we initialize the repLevels to be the min batch size but keep growing it as we read more batches, until it hit the total number of levels for the entire column chunk.

Copy link
Contributor Author

@tustvold tustvold Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users of RecordReader call read_records and then call consume_rep_levels and friends to split data out. The result being it should only buffer a little bit more than the batch_size passed to read_records.

I agree this API is not particularly intuitive, I created #1032 in part because I felt these APIs were clearly not designed for external consumption. I believe the funky arises because ArrayReader wants to be able to stitch together multiple column chunks from different row groups (i.e. PageReader) into the same RecordBatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. Yea I think consume_rep_levels and the friends are for assembling complex records like array, list and map. It'd be nice if we can simplify the APIs.

@tustvold
Copy link
Contributor Author

Further context for this PR can be found in #1041 as it was what motivated me to juggle the logic a bit, so that I could traitify it

}

if (records_read >= num_records) || end_of_column {
if end_of_column {
// Since page reader contains complete records, if we reached end of a
Copy link
Member

@sunchao sunchao Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is true though. Take parquet-mr as example, this is true for the latest version but in versions before 1.11.0, it seems there is no such guarantee: https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.1/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L106, and a repeated list could span multiple pages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below, page reader is a column chunk. So this is effectively saying that a record can't be split across row groups, which I think is guaranteed?

}

if (records_read >= num_records) || end_of_column {
if end_of_column {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should be called end_of_page since read_records consumes at most a page? a new page is set in ArrayReader.next_batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehehe, PageReader is actually a column chunk... So the end of a PageReader is the end of a row group, not the end of a page. Confusingly PageIterator is an iterator of PageReader which are themselves iterators of Page 😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, thanks 🤦 . It all makes sense now!

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sunchao sunchao merged commit 07660c6 into apache:master Dec 13, 2021
@sunchao
Copy link
Member

sunchao commented Dec 13, 2021

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Record Reader Incomplete State
4 participants