Read Parquet byte sizes in batches, rather than individually #3769

bmcdonald3 · 2024-09-11T20:15:53Z

In order to improve performance, this PR switches from reading the byte sizes of each string in a Parquet file in batches, rather than one at a time, which results in a significant performance improvement.

Here are results from a Cray XC with Lustre filesystem:

Batch byte calculation:

test	sec
single-file	4.981
fixed-single	4.037
scaled-five	2.135
fixed-scaled-five	1.744
five	10.525
fixed-five	9.068
scaled-ten	1.094
fixed-scaled-ten	0.971
ten	11.532
fixed-ten	9.747

Old:

test	sec
single-file	7.907
fixed-single	4.026
scaled-five	4.021
fixed-scaled-five	1.754
five	17.076
fixed-five	4.997
scaled-ten	1.782
fixed-scaled-ten	0.978
ten	17.802
fixed-ten	9.499

Noting that none of the "fixed" rows are impacted, since this only affects byte calculation, which is skipped when using the fixed length optimizaiton.

bmcdonald3 self-assigned this Sep 11, 2024

bmcdonald3 mentioned this issue Sep 11, 2024

Closes #3769: Read Parquet byte sizes in batches, rather than individually #3770

Merged

stress-tess closed this as completed in #3770 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read Parquet byte sizes in batches, rather than individually #3769

Read Parquet byte sizes in batches, rather than individually #3769

bmcdonald3 commented Sep 11, 2024

Read Parquet byte sizes in batches, rather than individually #3769

Read Parquet byte sizes in batches, rather than individually #3769

Comments

bmcdonald3 commented Sep 11, 2024