Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Parquet byte sizes in batches, rather than individually #3769

Closed
bmcdonald3 opened this issue Sep 11, 2024 · 0 comments · Fixed by #3770
Closed

Read Parquet byte sizes in batches, rather than individually #3769

bmcdonald3 opened this issue Sep 11, 2024 · 0 comments · Fixed by #3770
Assignees

Comments

@bmcdonald3
Copy link
Contributor

In order to improve performance, this PR switches from reading the byte sizes of each string in a Parquet file in batches, rather than one at a time, which results in a significant performance improvement.

Here are results from a Cray XC with Lustre filesystem:

Batch byte calculation:

test sec
single-file 4.981
fixed-single 4.037
scaled-five 2.135
fixed-scaled-five 1.744
five 10.525
fixed-five 9.068
scaled-ten 1.094
fixed-scaled-ten 0.971
ten 11.532
fixed-ten 9.747

Old:

test sec
single-file 7.907
fixed-single 4.026
scaled-five 4.021
fixed-scaled-five 1.754
five 17.076
fixed-five 4.997
scaled-ten 1.782
fixed-scaled-ten 0.978
ten 17.802
fixed-ten 9.499

Noting that none of the "fixed" rows are impacted, since this only affects byte calculation, which is skipped when using the fixed length optimizaiton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant