[DOC] Published benchmarks showing performance comparison v/s Pandas #12295

shwina · 2022-12-02T19:07:01Z

As part of our documentation, it would be helpful to publish plots showing speedup versus Pandas for commonly used APIs.

Note that this is a distinct ask from our "micro" benchmarks, which are more for developer use and identifying performance regressions.

As a starting point, we could include the following:

Reading and writing csv and parquet files
groupby-aggregation and grouped window operations
Join
String operations
User defined functions (numeric, string, and grouped)

Ideally, the code for generating benchmarks/plots would be available as a notebook for ease of reproducibility.

The text was updated successfully, but these errors were encountered:

vyasr · 2023-01-30T22:24:09Z

Note that this is a distinct ask from our "micro" benchmarks, which are more for developer use and identifying performance regressions.

I am curious how you would envision drawing the line between the two here. The microbenchmarks are designed to be benchmarks of individual APIs. I would not consider a larger-scale, workflow-level benchmark to be a good fit for the benchmarks (consider a script that loads a pair of parquet files, does some preprocessing of each, then does a join between them, then... etc). However, any benchmark of a single API would fit well with the benchmarking suite (e.g. df.groupby(...).apply(...)). It sounds like that is the sort of benchmark intended for publication as well. If we anticipate significant overlap, I think we would be better off putting benchmarks into the existing suite and then finding a way to publish the results.

shwina · 2023-01-30T23:15:05Z

The aim of this notebook is to produce plots showing speedups over Pandas for a small number of "representative" APIs, as an initial "pitch" to the Pandas user. Ideally we'd pubish those plots as part of https://rapids.ai and/or the cuDF docs.

The goal of a benchmarking suite should be to cover as many APIs as possible, run with as many arguments and input data types as possible, so as to enable finding and fixing performance regressions.

I agree that many of the benchmarks we include in the notebook can also be added to the benchmarking suite, but my hope is that this notebook remains largely static while the benchmarking suite continues to grow, so that the overlap becomes less and less significant.

I also agree that we can find ways to publish the results of the existing suite, but in terms of "home page" benchmark plots that we want users to be able to download the code for, reproduce, and potentially make small modifications to, on their own machines, a Jupyter Notebook is best.

cc: @beckernick curious if you have any thoughts here as well

vyasr · 2023-01-30T23:53:01Z

If the goal is to allow users to download the notebook and play with it themselves, then I'm on board with that. If the benchmarks end up being a strict subset of the microbenchmarks, though, we should consider using something like jupyter's %load to load the benchmarks directly from the benchmark files (although perhaps we'll need more fine-grained control than that, in which case something like nbconvert might be appropriate). I think what's worth doing mostly hinges on how accurate this statement ends up being:

my hope is that this notebook remains largely static

My main concern is ensuring that we don't end up maintaining two sets of benchmarks.

Resolves: #12295 This PR introduces a notebook of benchmarks that users will be able to run if they download the notebook. The notebook also generates graphs which are going to show up in cudf python docs. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) URL: #12595

shwina added doc Documentation Needs Triage Need team to review and classify labels Dec 2, 2022

shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Dec 2, 2022

shwina assigned galipremsagar Dec 2, 2022

galipremsagar mentioned this issue Jan 23, 2023

[REVIEW] Add performance benchmarks to user facing docs #12595

Merged

3 tasks

rapids-bot bot closed this as completed in #12595 Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] Published benchmarks showing performance comparison v/s Pandas #12295

[DOC] Published benchmarks showing performance comparison v/s Pandas #12295

shwina commented Dec 2, 2022

vyasr commented Jan 30, 2023

shwina commented Jan 30, 2023 •

edited

Loading

vyasr commented Jan 30, 2023

[DOC] Published benchmarks showing performance comparison v/s Pandas #12295

[DOC] Published benchmarks showing performance comparison v/s Pandas #12295

Comments

shwina commented Dec 2, 2022

vyasr commented Jan 30, 2023

shwina commented Jan 30, 2023 • edited Loading

vyasr commented Jan 30, 2023

shwina commented Jan 30, 2023 •

edited

Loading