Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Warn on parquet row group sizes out of recommended bounds #424

Closed
benfred opened this issue Nov 10, 2020 · 2 comments · Fixed by #455
Closed

[FEA] Warn on parquet row group sizes out of recommended bounds #424

benfred opened this issue Nov 10, 2020 · 2 comments · Fixed by #455
Assignees

Comments

@benfred
Copy link
Member

benfred commented Nov 10, 2020

Is your feature request related to a problem? Please describe.
We should warn on parquet files that contain row groups bigger than recommended - with actionable links and information for our customers

@gabrielspmoreira
Copy link
Member

To detail that issue, NVT does actually raise a warning like the following one.

/nvtabular/nvtabular/io/parquet.py:75: UserWarning: Row group size 4017728134 is bigger than requested part_size 1000000000
  f"Row group size {rg_byte_size_0} is bigger than requested part_size 

But it gives no recommendation on how to properly define the row group sizes for the parquet files, in respect of the configured NVT dataset part files (e.g. nvt.Dataset(TRAIN_DIR, engine="parquet", part_size="1000MB")). It would be nice to include code examples on how to transform a set of parquet with the desired row group size using PyArrow, Pandas or cuDF (e.g. using Pandas: df.to_parquet("filename.parquet", row_group_size=500, engine="pyarrow"))

It would also be interesting to have in the documentation some recommendation on how to set properly the NVT dataset part_size based on how much GPU memory you have reserved for the Dask cluster (LocalCUDACluster(device_memory_limit = ?))) and RMM ( rmm.reinitialize(pool_allocator=True, initial_pool_size=?,) )

@EvenOldridge
Copy link
Member

The documentation side absolutely. We should also throw a better warning along the lines of:

'To achieve optimal performance the row_group_size of the parquet files should be in the x to y range. For more information about data prep visit: .'

And then we need the page for how to effectively prepare data for NVTabular. @gabrielspmoreira and @bschifferer can you please work on the docs page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants