You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We should warn on parquet files that contain row groups bigger than recommended - with actionable links and information for our customers
The text was updated successfully, but these errors were encountered:
To detail that issue, NVT does actually raise a warning like the following one.
/nvtabular/nvtabular/io/parquet.py:75: UserWarning: Row group size 4017728134 is bigger than requested part_size 1000000000
f"Row group size {rg_byte_size_0} is bigger than requested part_size
But it gives no recommendation on how to properly define the row group sizes for the parquet files, in respect of the configured NVT dataset part files (e.g. nvt.Dataset(TRAIN_DIR, engine="parquet", part_size="1000MB")). It would be nice to include code examples on how to transform a set of parquet with the desired row group size using PyArrow, Pandas or cuDF (e.g. using Pandas: df.to_parquet("filename.parquet", row_group_size=500, engine="pyarrow"))
It would also be interesting to have in the documentation some recommendation on how to set properly the NVT dataset part_size based on how much GPU memory you have reserved for the Dask cluster (LocalCUDACluster(device_memory_limit = ?))) and RMM ( rmm.reinitialize(pool_allocator=True, initial_pool_size=?,) )
Is your feature request related to a problem? Please describe.
We should warn on parquet files that contain row groups bigger than recommended - with actionable links and information for our customers
The text was updated successfully, but these errors were encountered: