Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocate enough bytes when writing booleans in parquet writer #658

Merged
merged 2 commits into from
Aug 8, 2021

Conversation

bjchambers
Copy link
Contributor

Which issue does this PR close?

Closes #657.

Rationale for this change

Without this change, writing a batch with more than 2048 boolean values may fail to extend the bit vector enough.

What changes are included in this PR?

  • A test demonstrating the failure
  • A change to allocate enough bytes for the boolean values being written

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 5, 2021
@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2021

Codecov Report

Merging #658 (9d0621c) into master (6bf1988) will decrease coverage by 0.07%.
The diff coverage is 93.75%.

❗ Current head 9d0621c differs from pull request most recent head 2652193. Consider uploading reports for the commit 2652193 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #658      +/-   ##
==========================================
- Coverage   82.50%   82.43%   -0.08%     
==========================================
  Files         168      168              
  Lines       47249    47265      +16     
==========================================
- Hits        38984    38961      -23     
- Misses       8265     8304      +39     
Impacted Files Coverage Δ
parquet/src/data_type.rs 77.17% <83.33%> (+0.63%) ⬆️
parquet/src/arrow/arrow_writer.rs 98.06% <100.00%> (+0.02%) ⬆️
arrow/src/array/transform/boolean.rs 76.92% <0.00%> (-7.70%) ⬇️
arrow/src/array/transform/utils.rs 95.00% <0.00%> (-5.00%) ⬇️
arrow/src/array/equal_json.rs 85.21% <0.00%> (-3.48%) ⬇️
arrow/src/tensor.rs 85.00% <0.00%> (-2.50%) ⬇️
parquet/src/column/page.rs 97.36% <0.00%> (-1.32%) ⬇️
arrow/src/array/equal/utils.rs 74.00% <0.00%> (-1.00%) ⬇️
parquet/src/record/api.rs 91.60% <0.00%> (-0.88%) ⬇️
parquet/src/file/statistics.rs 93.80% <0.00%> (-0.83%) ⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bf1988...2652193. Read the comment docs.

@alamb alamb requested a review from sunchao August 5, 2021 12:25
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bjchambers 👍 . This looks good to me.

I also double checked the test case covers the bug. Without the code changes, it fails like this:

---- arrow::arrow_writer::tests::bool_large_single_column stdout ----
thread 'arrow::arrow_writer::tests::bool_large_single_column' panicked at 'called `Result::unwrap()` on an `Err` value: EOF("unable to put boolean value")', parquet/src/arrow/arrow_writer.rs:1224:39

I would like at least one other person to review this change too (perhaps @sunchao or @nevi-me ) before we merge it

parquet/src/data_type.rs Show resolved Hide resolved
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nevi-me nevi-me merged commit 75432ed into apache:master Aug 8, 2021
@alamb alamb changed the title allocate enough bytes when writing booleans allocate enough bytes when writing booleans in parquet writer Aug 8, 2021
alamb pushed a commit that referenced this pull request Aug 8, 2021
* allocate enough bytes when writing booleans

* round up to nearest multiple of 256
alamb added a commit that referenced this pull request Aug 9, 2021
* allocate enough bytes when writing booleans

* round up to nearest multiple of 256

Co-authored-by: Ben Chambers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing Parquet with a boolean column fails
4 participants