Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #2937 : Fix string read bug for large Parquet files #2938

Merged
merged 1 commit into from
Jan 31, 2024

Conversation

bmcdonald3
Copy link
Contributor

When the size of the number of rows in a parquet string column exceeds the batch size, this can sometimes give incorrect results since the readBatch function utilizes the definition level in a unique way as opposed to other types, causing false positives for whether or not a string was read. In order to resolve this, I am reverting the batch size reading, which will reduce performance, but since we are planning on reworking string reads anyway, this isn't that big of a loss.

Closes #2937

When the size of the number of rows in a parquet string column
exceeds the batch size, this can sometimes give incorrect results
since the readBatch function utilizes the definition level in a
unique way as opposed to other types, causing false positives for
whether or not a string was read. In order to resolve this, I am
reverting the batch size reading, which will reduce performance,
but since we are planning on reworking string reads anyway, this
isn't that big of a loss.
@stress-tess stress-tess added this pull request to the merge queue Jan 31, 2024
Merged via the queue into Bears-R-Us:master with commit d682e55 Jan 31, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String read parquet columns sometimes gives incorrect results when reading large files
3 participants