You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR: the primary pain point here is huge (in terms of total uncompressed byte size) row groups - writing the PageIndex OR reducing row group sizes, perhaps both, would help a lot.
Basically, the defaults in pyarrow (and most parquet implementations) for row group sizes (1 million rows per row group) are predicated on assumptions about what a typical parquet file looks like (lots of numerics, booleans, relatively short strings amenable to dictionary, RLE and delta encoding); wide text datasets are very much not typical, and default row group sizes get you ~2GB per row group (and nearly 4GB uncompressed, just for the text column).
The simplest thing to do would be to default to 100k for the row_group_size parameter - more or less the inflection point of this benchmark by DuckDB (size overhead is about 0.5%).
Setting write_page_index to true should help a great deal (arguably much more than smaller row groups), as readers can use that to refine reads to individual data pages (not unusual for point lookups to hit 0.1% of a file).
The text was updated successfully, but these errors were encountered:
TLDR: the primary pain point here is huge (in terms of total uncompressed byte size) row groups - writing the PageIndex OR reducing row group sizes, perhaps both, would help a lot.
Basically, the defaults in pyarrow (and most parquet implementations) for row group sizes (1 million rows per row group) are predicated on assumptions about what a typical parquet file looks like (lots of numerics, booleans, relatively short strings amenable to dictionary, RLE and delta encoding); wide text datasets are very much not typical, and default row group sizes get you ~2GB per row group (and nearly 4GB uncompressed, just for the text column).
The simplest thing to do would be to default to 100k for the row_group_size parameter - more or less the inflection point of this benchmark by DuckDB (size overhead is about 0.5%).
Setting
write_page_index
to true should help a great deal (arguably much more than smaller row groups), as readers can use that to refine reads to individual data pages (not unusual for point lookups to hit 0.1% of a file).The text was updated successfully, but these errors were encountered: