[REVIEW] Add performance benchmarks to user facing docs #12595

galipremsagar · 2023-01-23T23:29:35Z

Description

Resolves: #12295

This PR introduces a notebook of benchmarks that users will be able to run if they download the notebook. The notebook also generates graphs which are going to show up in cudf python docs.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

review-notebook-app · 2023-01-23T23:29:39Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

galipremsagar · 2023-01-24T00:08:44Z

docs/cudf/source/user_guide/benchmarks.ipynb

@@ -0,0 +1,1059 @@
+{


The speedups for consequent runs of UDF's are insane! Is it okay to showcase this in those notebook @shwina?

Reply via ReviewNB

codecov · 2023-01-24T01:16:31Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@5b62562). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head df0fc64 differs from pull request most recent head 1962902. Consider uploading reports for the commit 1962902 to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-23.04   #12595   +/-   ##
===============================================
  Coverage                ?   85.73%           
===============================================
  Files                   ?      155           
  Lines                   ?    24889           
  Branches                ?        0           
===============================================
  Hits                    ?    21339           
  Misses                  ?     3550           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

GregoryKimball

Thank you Prem for starting this work. This is very important!

GregoryKimball · 2023-01-24T02:53:32Z

docs/cudf/source/user_guide/benchmarks.ipynb

+    "    \"numbers\": np.random.randint(-1000, 1000, 10_000_000, dtype='int64'),\n",
+    "    \"business\": np.random.choice([\"McD\", \"Buckees\", \"Walmart\", \"Costco\"], size=10_000_000)\n",
+    "})"


Why not use cupy here for the gdf?

Primarily because cupy doesn't support str types yet.

GregoryKimball · 2023-01-24T03:05:14Z

docs/cudf/source/user_guide/benchmarks.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pandas_read_parquet = time_it(lambda : pd.read_parquet(\"pandas.parquet\"))"


For the reader functions, we need to consider cache clearing. I always use this command when benchmarking file systems: os.system("/sbin/sysctl vm.drop_caches=3")

Also, we don't know where the file is actually going for the user that runs this! It could be a local drive, it could be a network drive, it could be a virtual drive that is actually a network drive. Generally faster drives allow our readers to show faster speedups. We may want to compare host buffers and files, and provide some analysis based on the comparison.

All that said, for the purposes of this notebook I think your approach is the best one. We could include more documentation about how IO works. We could also add a drive read speed test, and report performance relative to that. I'm looking forward to discussing this more with you.

docs/cudf/source/user_guide/performance_comparisons.ipynb

bdice · 2023-02-08T23:08:05Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1699 @@
+{


Line #1. pandas_upper = timeit.timeit(lambda : pd_series.str.upper(), number=20)
Can we write all these benchmarks in a more reusable style? Maybe track the results in a dictionary?
def bench(pdf, gdf, func, **kwargs):
pdf_time = timeit.timeit(func(pdf), **kwargs) gdf_time = timeit.timeit(func(gdf), **kwargs) return pdf_time, gdf_time

upper = bench(pd_series, gd_series, lambda df: df.str.upper(), number=20)
contains = bench(pd_series, gd_series, lambda df: df.str.contains(r"[0-9][a-z]"), number=20)

Reply via ReviewNB

exactlyallan · 2023-02-15T22:52:35Z

@galipremsagar looking to run the notebook this week - think thats possible?

bdice · 2023-02-24T20:26:14Z

Let's add a symlink in the notebooks directory.

galipremsagar · 2023-02-24T20:34:11Z

Let's add a symlink in the notebooks directory.

Done 👍

bdice · 2023-02-24T20:49:31Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #2. gdf
Combine with the above cell. No need to show both pdf and gdf outputs. Readers will trust that's correct.

Reply via ReviewNB

bdice · 2023-02-24T20:49:31Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #9. gd_obj : cuDF object
Let's name this cudf_obj. Match the module name (pd, cudf).

Reply via ReviewNB

Ah, I see bradley has the same idea.

bdice · 2023-02-24T20:49:31Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #1. pdf = pdf.head(100_000_000)
Why did we create 300M rows, only to trim it to 100M? It's a bit confusing to define num_rows and then not benchmark data with that number of rows. Let's use 100M or 300M everywhere.

I see below that we redefine num_rows = 1_000_000 with a new dataset. That's okay - but within the benchmarks of a given dataset, we should be consistent so that it doesn't look like we're cherry-picking data sizes to show the best speedup (especially for smaller datasets like 1M or 100M instead of 300M).

Reply via ReviewNB

bdice · 2023-02-24T20:49:31Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #5. _ = gc.collect()
If you're trying to avoid showing the result, just write gc.collect(); with a semicolon to prevent displaying the output. Assigning the result looks odd.

Reply via ReviewNB

bdice · 2023-02-24T20:49:32Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #1. gd_series = cudf.from_pandas(pd_series)
Combine so all this is in one cell:

num_rows = 300_000_000
pd_series = ...
gd_series = ...

Reply via ReviewNB

bdice · 2023-02-24T20:49:33Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #1. gdf_age = cudf.from_pandas(pdf_age)
Combine this cell with the cell above.

Reply via ReviewNB

bdice · 2023-02-24T20:49:33Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #1. gd_series = cudf.from_pandas(pd_series)
Combine with the cell above.

Reply via ReviewNB

bdice · 2023-02-24T20:49:33Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #3. pdf['key'] = np.random.randint(0,2,size)
Add spaces after the commas.

Separate issue: We should enable the nbqa pre-commit hook with black/isort/etc. for our repo... example: https://github.com/glotzerlab/signac-examples/blob/6c0f1efdaf87d361f29d8e035025a87fccf4f57d/.pre-commit-config.yaml#L40-L52

Reply via ReviewNB

bdice · 2023-02-24T20:49:33Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #1. performance_df
Combine with previous cell.

Reply via ReviewNB

bdice · 2023-02-24T20:49:34Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1568 @@
+{


Line #1. !lscpu
FYI: This benchmarks a CPU released in 2019 against the H100 GPU made available in 2022-2023. I saw that the website benchmarks Allan collected used an AMD EPYC 7642, which is also a 2019-2020 era CPU, with an A100 (released in May 2020). That's a more fair comparison.

Reply via ReviewNB

Particularly since a "good" implementation should be bandwidth-bound, and the max single-socket spec-sheet bandwidth on one of these chips is 140GB/s (ballpark). Probably single-core stream will top out at 20GB/s.

@galipremsagar

This enables `black` and `isort` linters for ipynb notebooks via [nbqa](https://github.com/nbQA-dev/nbQA). I propose this change to avoid manually linting notebooks like #12595. cc: @galipremsagar Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #12848

exactlyallan · 2023-03-07T22:04:18Z

@bdice and @galipremsagar think we could wrap this up so the site can launch?

bdice

Approving for now, so that the notebook can be linked from the new rapids.ai site. We will address further review comments in a follow-up PR. (Discussed offline with @galipremsagar and @exactlyallan)

wence- · 2023-03-08T16:03:56Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


In the different sections can we have a little bit of setup? Possibly with some linking to user-guide docs as well?

Reply via ReviewNB

wence- · 2023-03-08T16:03:57Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


Line #2. gdf
gdf is a somewhat impenetrable name (if the pandas dataframe is called pdf why is the cudf dataframe not called cdf ?). If this is didactic, perhaps use pandas_df and cudf_df throughout?

Reply via ReviewNB

wence- · 2023-03-08T16:03:57Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


Line #9. plt.show()
Is it possible to also show absolute time as a separate subplot (or a right second y-axis)? If pandas is only half a second the speedup is still impressive, but is less motivating than if one goes from 30 seconds to a fraction of a second.

Reply via ReviewNB

wence- · 2023-03-08T16:03:57Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


Line #6. _ = gc.collect()
Although I think I know what you're doing, this is kind of an odd thing to have to do at all. It's not the kind of thing you would necessarily write as part of a normal analysis I think.

Does it make sense to instead define the benchmarks inside functions that set up the data, run, and collate results. Then the dataframes don't escape the scope of the function call and will get collected automatically.

In particular, the requirement to do gc.collect() (presumably to clean out cycles) is an anti-pattern that readers of this notebook may well cargo-cult.

Reply via ReviewNB

wence- · 2023-03-08T16:03:57Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


Line #5. )
It would be nice if were able to make:

strings that have less baggage associated with them
Pick some distribution of lengths that is perhaps realistic. One could either have a normal distribution, or else perhaps a heavy-tailed log-normal. Here's some recent discussion on statistical models of sentence length: https://aclanthology.org/W19-5710.pdf

Reply via ReviewNB

wence- · 2023-03-08T16:03:58Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


Line #1. num_rows = 10_000_000
The size of the data keeps on changing. Maybe there is a good reason for this, but if there is, I think it should be spelled out. If not, then it looks odd.

Reply via ReviewNB

wence- · 2023-03-08T16:03:58Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


Line #5. return 1
And idiomatic writing of this function would be: return int(row.isupper()) does that _work_?

Reply via ReviewNB

wence- · 2023-03-08T16:03:58Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


Line #5. pd_series
Same comments on the strings apply here as before.

Reply via ReviewNB

wence- · 2023-03-08T16:03:59Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


As mentioned, if we collate the individual runs then we can use a box-and-whiskers or violin plot to show all of this data in one, I think.

There's also a question of what the fairest comparison is. I think, if possible, we should also show the "first run" cost if loading the jitted code from the disk cache (I think that's a thing?). Since you might run the same workflow many times from scratch, but if the cache persists that's the number you care about.

Reply via ReviewNB

wence- · 2023-03-08T16:03:59Z

docs/cudf/source/user_guide/performance_comparisons.ipynb

@@ -0,0 +1,1647 @@
+{


Line #5. plt.show()
Does this plot add anything that the (single-entry) table does not?

Reply via ReviewNB

galipremsagar · 2023-03-08T20:51:21Z

/merge

galipremsagar added 8 commits January 4, 2023 17:20

first pass

53974e0

Merge remote-tracking branch 'upstream/branch-23.02' into 12295

6d13db6

updates

83394eb

Merge remote-tracking branch 'upstream/branch-23.02' into 12295

30482c0

groupby

633657f

Merge remote-tracking branch 'upstream/branch-23.02' into 12295

451839b

add strings

6c99181

run

63e5ea4

galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 23, 2023

galipremsagar self-assigned this Jan 23, 2023

galipremsagar added 2 commits January 23, 2023 15:53

Merge remote-tracking branch 'upstream/branch-23.02' into 12295

88f9ec7

showcase second udf runs :)

09654cc

galipremsagar commented Jan 24, 2023

View reviewed changes

GregoryKimball reviewed Jan 24, 2023

View reviewed changes

exactlyallan mentioned this pull request Jan 24, 2023

Add mini benchmarks to site introduction rapidsai/rapids.ai#276

Closed

galipremsagar added 3 commits January 25, 2023 20:10

Update benchmarks with H100 runs

4d1aca8

Merge branch 'branch-23.02' into 12295

df0fc64

Merge remote-tracking branch 'upstream/branch-23.04' into 12295

6f2f6da

galipremsagar changed the base branch from branch-23.02 to branch-23.04 February 8, 2023 21:50

galipremsagar added 2 commits February 8, 2023 14:22

make more updates

d3add4f

Merge remote-tracking branch 'upstream/branch-23.04' into 12295

3654c34

galipremsagar marked this pull request as ready for review February 8, 2023 22:50

galipremsagar changed the title ~~[WIP] Add performance benchmarks to user facing docs~~ [REVIEW] Add performance benchmarks to user facing docs Feb 8, 2023

bdice reviewed Feb 8, 2023

View reviewed changes

address some reviews

14cba7a

style fix

45f0e28

add symlink

4026111

github-actions bot added the Python Affects Python cuDF API. label Feb 24, 2023

Delete performance-comparisons.ipynb

9012457

galipremsagar requested a review from a team as a code owner February 24, 2023 20:29

github-actions bot removed the Python Affects Python cuDF API. label Feb 24, 2023

galipremsagar added 2 commits February 24, 2023 12:30

Merge remote-tracking branch 'upstream/branch-23.04' into 12295

55d2efb

symlink

a1f4521

github-actions bot added the Python Affects Python cuDF API. label Feb 24, 2023

bdice reviewed Feb 24, 2023

View reviewed changes

bdice mentioned this pull request Feb 24, 2023

Enable nbqa pre-commit hooks for isort and black. #12848

Merged

3 tasks

galipremsagar mentioned this pull request Feb 24, 2023

Replace nbtest with nbval. #12849

Closed

3 tasks

galipremsagar added 2 commits February 27, 2023 13:58

Merge remote-tracking branch 'upstream/branch-23.04' into 12295

05df3df

address reviews

f200a5a

bdice approved these changes Mar 7, 2023

View reviewed changes

merge

8ff7170

galipremsagar force-pushed the 12295 branch from bc6af52 to 8ff7170 Compare March 7, 2023 22:17

galipremsagar added 2 commits March 7, 2023 14:30

update

e02d64f

skip

3e409a8

github-actions bot added the ci label Mar 8, 2023

wence- reviewed Mar 8, 2023

View reviewed changes

Merge branch 'branch-23.04' into 12295

1962902

galipremsagar requested a review from a team as a code owner March 8, 2023 19:02

raydouglass approved these changes Mar 8, 2023

View reviewed changes

rapids-bot bot merged commit 553162c into rapidsai:branch-23.04 Mar 8, 2023

[REVIEW] Add performance benchmarks to user facing docs #12595

[REVIEW] Add performance benchmarks to user facing docs #12595

Conversation

galipremsagar commented Jan 23, 2023 • edited Loading

Description

Checklist

review-notebook-app bot commented Jan 23, 2023

galipremsagar Jan 24, 2023 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jan 24, 2023 • edited Loading

Codecov Report

GregoryKimball left a comment

Choose a reason for hiding this comment

GregoryKimball Jan 24, 2023

Choose a reason for hiding this comment

galipremsagar Feb 8, 2023

Choose a reason for hiding this comment

GregoryKimball Jan 24, 2023

Choose a reason for hiding this comment

bdice Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

galipremsagar Feb 15, 2023

Choose a reason for hiding this comment

exactlyallan commented Feb 15, 2023

bdice commented Feb 24, 2023

galipremsagar commented Feb 24, 2023

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023

Choose a reason for hiding this comment

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

bdice Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023

Choose a reason for hiding this comment

exactlyallan commented Mar 7, 2023

bdice left a comment • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

wence- Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

galipremsagar commented Mar 8, 2023

galipremsagar commented Jan 23, 2023 •

edited

Loading

galipremsagar Jan 24, 2023 •

edited

Loading

codecov bot commented Jan 24, 2023 •

edited

Loading

bdice Feb 8, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice Feb 24, 2023 •

edited

Loading

bdice left a comment •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading

wence- Mar 8, 2023 •

edited

Loading