[FEA] Warn on parquet row group sizes out of recommended bounds #424

benfred · 2020-11-10T18:15:18Z

Is your feature request related to a problem? Please describe.
We should warn on parquet files that contain row groups bigger than recommended - with actionable links and information for our customers

gabrielspmoreira · 2020-11-12T22:00:42Z

To detail that issue, NVT does actually raise a warning like the following one.

/nvtabular/nvtabular/io/parquet.py:75: UserWarning: Row group size 4017728134 is bigger than requested part_size 1000000000
  f"Row group size {rg_byte_size_0} is bigger than requested part_size

But it gives no recommendation on how to properly define the row group sizes for the parquet files, in respect of the configured NVT dataset part files (e.g. nvt.Dataset(TRAIN_DIR, engine="parquet", part_size="1000MB")). It would be nice to include code examples on how to transform a set of parquet with the desired row group size using PyArrow, Pandas or cuDF (e.g. using Pandas: df.to_parquet("filename.parquet", row_group_size=500, engine="pyarrow"))

It would also be interesting to have in the documentation some recommendation on how to set properly the NVT dataset part_size based on how much GPU memory you have reserved for the Dask cluster (LocalCUDACluster(device_memory_limit = ?))) and RMM ( rmm.reinitialize(pool_allocator=True, initial_pool_size=?,) )

EvenOldridge · 2020-11-12T22:47:31Z

The documentation side absolutely. We should also throw a better warning along the lines of:

'To achieve optimal performance the row_group_size of the parquet files should be in the x to y range. For more information about data prep visit: .'

And then we need the page for how to effectively prepare data for NVTabular. @gabrielspmoreira and @bschifferer can you please work on the docs page.

EvenOldridge mentioned this issue Nov 12, 2020

[BUG] "Schemas are inconsistent" error for parquet files which have the same dtype, but are different in not null setting #429

Closed

benfred assigned gabrielspmoreira Nov 16, 2020

gabrielspmoreira mentioned this issue Nov 23, 2020

Documentation on the requirements and troubleshooting for NVTabular input parquet dataset #455

Merged

benfred closed this as completed in #455 Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Warn on parquet row group sizes out of recommended bounds #424

[FEA] Warn on parquet row group sizes out of recommended bounds #424

benfred commented Nov 10, 2020 •

edited

Loading

gabrielspmoreira commented Nov 12, 2020

EvenOldridge commented Nov 12, 2020

[FEA] Warn on parquet row group sizes out of recommended bounds #424

[FEA] Warn on parquet row group sizes out of recommended bounds #424

Comments

benfred commented Nov 10, 2020 • edited Loading

gabrielspmoreira commented Nov 12, 2020

EvenOldridge commented Nov 12, 2020

benfred commented Nov 10, 2020 •

edited

Loading