Optimize parquet output for remote reading #263

H-Plus-Time · 2024-08-02T10:59:57Z

TLDR: the primary pain point here is huge (in terms of total uncompressed byte size) row groups - writing the PageIndex OR reducing row group sizes, perhaps both, would help a lot.

Basically, the defaults in pyarrow (and most parquet implementations) for row group sizes (1 million rows per row group) are predicated on assumptions about what a typical parquet file looks like (lots of numerics, booleans, relatively short strings amenable to dictionary, RLE and delta encoding); wide text datasets are very much not typical, and default row group sizes get you ~2GB per row group (and nearly 4GB uncompressed, just for the text column).

The simplest thing to do would be to default to 100k for the row_group_size parameter - more or less the inflection point of this benchmark by DuckDB (size overhead is about 0.5%).

Setting write_page_index to true should help a great deal (arguably much more than smaller row groups), as readers can use that to refine reads to individual data pages (not unusual for point lookups to hit 0.1% of a file).

The text was updated successfully, but these errors were encountered:

guipenedo · 2024-08-28T10:25:47Z

Thanks for bringing this up. Would you be willing to implement this change in a PR?

H-Plus-Time · 2024-08-28T11:06:48Z

Yep, should have something over the weekend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize parquet output for remote reading #263

Optimize parquet output for remote reading #263

H-Plus-Time commented Aug 2, 2024

guipenedo commented Aug 28, 2024

H-Plus-Time commented Aug 28, 2024

Optimize parquet output for remote reading #263

Optimize parquet output for remote reading #263

Comments

H-Plus-Time commented Aug 2, 2024

guipenedo commented Aug 28, 2024

H-Plus-Time commented Aug 28, 2024