-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Published benchmarks showing performance comparison v/s Pandas #12295
Comments
I am curious how you would envision drawing the line between the two here. The microbenchmarks are designed to be benchmarks of individual APIs. I would not consider a larger-scale, workflow-level benchmark to be a good fit for the benchmarks (consider a script that loads a pair of parquet files, does some preprocessing of each, then does a join between them, then... etc). However, any benchmark of a single API would fit well with the benchmarking suite (e.g. |
The aim of this notebook is to produce plots showing speedups over Pandas for a small number of "representative" APIs, as an initial "pitch" to the Pandas user. Ideally we'd pubish those plots as part of https://rapids.ai and/or the cuDF docs. The goal of a benchmarking suite should be to cover as many APIs as possible, run with as many arguments and input data types as possible, so as to enable finding and fixing performance regressions. I agree that many of the benchmarks we include in the notebook can also be added to the benchmarking suite, but my hope is that this notebook remains largely static while the benchmarking suite continues to grow, so that the overlap becomes less and less significant. I also agree that we can find ways to publish the results of the existing suite, but in terms of "home page" benchmark plots that we want users to be able to download the code for, reproduce, and potentially make small modifications to, on their own machines, a Jupyter Notebook is best. cc: @beckernick curious if you have any thoughts here as well |
If the goal is to allow users to download the notebook and play with it themselves, then I'm on board with that. If the benchmarks end up being a strict subset of the microbenchmarks, though, we should consider using something like jupyter's
My main concern is ensuring that we don't end up maintaining two sets of benchmarks. |
Resolves: #12295 This PR introduces a notebook of benchmarks that users will be able to run if they download the notebook. The notebook also generates graphs which are going to show up in cudf python docs. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) URL: #12595
As part of our documentation, it would be helpful to publish plots showing speedup versus Pandas for commonly used APIs.
Note that this is a distinct ask from our "micro" benchmarks, which are more for developer use and identifying performance regressions.
As a starting point, we could include the following:
csv
andparquet
filesIdeally, the code for generating benchmarks/plots would be available as a notebook for ease of reproducibility.
The text was updated successfully, but these errors were encountered: