Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Published benchmarks showing performance comparison v/s Pandas #12295

Closed
shwina opened this issue Dec 2, 2022 · 3 comments · Fixed by #12595
Closed

[DOC] Published benchmarks showing performance comparison v/s Pandas #12295

shwina opened this issue Dec 2, 2022 · 3 comments · Fixed by #12595
Assignees
Labels
doc Documentation Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Dec 2, 2022

As part of our documentation, it would be helpful to publish plots showing speedup versus Pandas for commonly used APIs.

Note that this is a distinct ask from our "micro" benchmarks, which are more for developer use and identifying performance regressions.

As a starting point, we could include the following:

  • Reading and writing csv and parquet files
  • groupby-aggregation and grouped window operations
  • Join
  • String operations
  • User defined functions (numeric, string, and grouped)

Ideally, the code for generating benchmarks/plots would be available as a notebook for ease of reproducibility.

@shwina shwina added doc Documentation Needs Triage Need team to review and classify labels Dec 2, 2022
@shwina shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Dec 2, 2022
@vyasr
Copy link
Contributor

vyasr commented Jan 30, 2023

Note that this is a distinct ask from our "micro" benchmarks, which are more for developer use and identifying performance regressions.

I am curious how you would envision drawing the line between the two here. The microbenchmarks are designed to be benchmarks of individual APIs. I would not consider a larger-scale, workflow-level benchmark to be a good fit for the benchmarks (consider a script that loads a pair of parquet files, does some preprocessing of each, then does a join between them, then... etc). However, any benchmark of a single API would fit well with the benchmarking suite (e.g. df.groupby(...).apply(...)). It sounds like that is the sort of benchmark intended for publication as well. If we anticipate significant overlap, I think we would be better off putting benchmarks into the existing suite and then finding a way to publish the results.

@shwina
Copy link
Contributor Author

shwina commented Jan 30, 2023

The aim of this notebook is to produce plots showing speedups over Pandas for a small number of "representative" APIs, as an initial "pitch" to the Pandas user. Ideally we'd pubish those plots as part of https://rapids.ai and/or the cuDF docs.

The goal of a benchmarking suite should be to cover as many APIs as possible, run with as many arguments and input data types as possible, so as to enable finding and fixing performance regressions.

I agree that many of the benchmarks we include in the notebook can also be added to the benchmarking suite, but my hope is that this notebook remains largely static while the benchmarking suite continues to grow, so that the overlap becomes less and less significant.

I also agree that we can find ways to publish the results of the existing suite, but in terms of "home page" benchmark plots that we want users to be able to download the code for, reproduce, and potentially make small modifications to, on their own machines, a Jupyter Notebook is best.

cc: @beckernick curious if you have any thoughts here as well

@vyasr
Copy link
Contributor

vyasr commented Jan 30, 2023

If the goal is to allow users to download the notebook and play with it themselves, then I'm on board with that. If the benchmarks end up being a strict subset of the microbenchmarks, though, we should consider using something like jupyter's %load to load the benchmarks directly from the benchmark files (although perhaps we'll need more fine-grained control than that, in which case something like nbconvert might be appropriate). I think what's worth doing mostly hinges on how accurate this statement ends up being:

my hope is that this notebook remains largely static

My main concern is ensuring that we don't end up maintaining two sets of benchmarks.

rapids-bot bot pushed a commit that referenced this issue Mar 8, 2023
Resolves: #12295 

This PR introduces a notebook of benchmarks that users will be able to run if they download the notebook. The notebook also generates graphs which are going to show up in cudf python docs.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)

URL: #12595
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants