Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve scaling of data generation in NDS-H-cpp benchmarks #16987

Closed
GregoryKimball opened this issue Oct 3, 2024 · 2 comments · Fixed by #17039
Closed

[FEA] Improve scaling of data generation in NDS-H-cpp benchmarks #16987

GregoryKimball opened this issue Oct 3, 2024 · 2 comments · Fixed by #17039
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Milestone

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Oct 3, 2024

Is your feature request related to a problem? Please describe.
In the NDS-H-cpp benchmarks, the memory footprint of data generation is larger than the memory footprint of query execution. This ends up limiting us to <=SF10 on H100 GPUs. Perhaps as much as 10x smaller than we can go with pre-generated files.

Describe the solution you'd like
There are a few solutions we could use:

  • stage parquet files in host buffers instead of device buffers (in write_to_parquet_device_buffer)
  • add some way to use pre-generated files with the benchmarks (via nvbench axis? env var?)
  • use managed memory for data generation so we don't OOM
  • only generate the columns needed by the query, because most queries only use a smaller subset of columns
  • something else?

Additional context
On A100, we can run query sizes up to SF100 or so, but the generator only goes to ~SF10.

@GregoryKimball GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Oct 3, 2024
@GregoryKimball GregoryKimball added this to the Benchmarking milestone Oct 3, 2024
@karthikeyann
Copy link
Contributor

karthikeyann commented Oct 3, 2024

  • stage parquet files in host buffers instead of device buffers (in write_to_parquet_device_buffer)
    parquet buffers are created in host. it's explicitly copied to device. This could be modified.
  • add some way to use pre-generated files with the benchmarks (via nvbench axis? env var?)
    Originally the data generator was in examples, so idea was to generate files and reuse it. But we moved to generating on the fly by moving code to benchmarks.
  • use managed memory for data generation so we don't OOM
    ✔ by specifying --rmm_mode managed while running the benchmark, the benchmark will run using management memory for entire benchmark (both data generation and the query). we could consider use managed memory for data generation only, we should add options to these query benchmarks.
  • only generate the columns needed by the query, because most queries only use a smaller subset of columns
    ✅ This is already done by providing a list of table names to generate_parquet_data_sources(scale_factor, table_names, dest_sources). Individual columns in each table are not selectable now. Often a column generation is dependent on other columns. So, we will end up creating more columns.

We could create managed memory for data generation use it and destroy after writing the parquet data to host. Use this result for queries. But remember, host to device transfer is included as part of scan (parquet read) in benchmark time as well.
update API to accept cuio_source_sink_pair

@karthikeyann karthikeyann self-assigned this Oct 3, 2024
@GregoryKimball
Copy link
Contributor Author

Thank you @karthikeyann for your comments.

  • I like the idea of staging parquet data in host buffers because this is a common pattern for Spark-RAPIDS and "in-memory" databases. I understand that the extra HtoD copying time will show up in the runtime.
  • If we use managed memory for the data generation, and then switch back to the rmm_mode MR for the query, that seems like a good pattern.

In the end I would like to be able to run SF100 with CUDA async MR on A100. If the data gen uses managed MR and the timed queries use async MR, that would work great.

rapids-bot bot pushed a commit that referenced this issue Oct 23, 2024
Fixes #16987
Use managed memory to generate the parquet data, and write parquet data to host buffer.
Replace use of parquet_device_buffer with cuio_source_sink_pair

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #17039
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants