Add communications bandwidth to benchmarks #938

pentschev · 2022-06-22T13:41:35Z

The existing throughput would only give an overview of total data processed for the complete workflow, but no insight on communications which the new bandwidth value now does. Additionally, moved common peer-to-peer bandwidth computation as utility function.

Depends on #937

Two new utility functions are added to print benchmarks results to avoid the need to rearrange number of individual white spaces or separator lengths any time a new longer row is added, also preventing the count of spaces given indentation of code.

The existing throughput would only give an overview of total data processed for the complete workflow, but no insight to communications which the new throughput does. Additionally, moved common peer-to-peer bandwidth computation as utility function.

codecov-commenter · 2022-06-22T14:06:34Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@a2e3223). Click here to learn what that means.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             branch-22.08    #938   +/-   ##
==============================================
  Coverage                ?   0.00%           
==============================================
  Files                   ?      16           
  Lines                   ?    2092           
  Branches                ?       0           
==============================================
  Hits                    ?       0           
  Misses                  ?    2092           
  Partials                ?       0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a2e3223...dc1a587. Read the comment docs.

wence- · 2022-06-22T16:22:04Z

dask_cuda/benchmarks/local_cupy.py

+                key="Bandwidth",
+                value=f"{format_bytes(bandwidths_all.mean())}/s +/- "
+                f"{format_bytes(bandwidths_all.std())}/s",
+            )


FWIW, and this is just the same as before, the arithmetic mean is generally the wrong thing for rate-like data, and the more correct option is the harmonic mean len(xs)/sum(1/xs).

Yeah, I was actually wondering if that was correct but didn't spend the time in investigating what would be the correct way to compute. Harmonic means do make sense indeed. However, it looks depressing 🙁

Bandwidth | 10.18 GiB/s +/- 8.16 GiB/s BandwidthH | 3.47 GiB/s +/- 1.25 GiB/s

We seemingly go down from 10.18GB/s to 3.47GB/s, of course the former is incorrect anyway, but it is sad.

I now replaced mean and standard deviation in 268f893, I am not sure whether standard deviation is correct, can you double-check?

Standard deviations of inverse random variables is a huge can of worms :(, but as per this stats.stackexchange post there are a few options. Suppose we have our rates $X_i, \quad i=[1,\dots,n]$, then define the reciprocal weights $Y_i$ and calculate

$$ \begin{align} \mu_Y &= \frac{1}{n}\sum_i Y_i \\ \text{Var}(Y) &= \frac{1}{n}\sum_i (Y_i - \mu_Y)^2 \end{align} $$

Then the variance of the harmonic mean of $X_i$ ($H_X := \frac{1}{\mu_Y}$) is

$$ \text{Var}_H = \frac{\text{Var}(Y)}{n \mu_Y^4} $$

The above-linked post shows that this is a good estimate for uniformly distributed data at least.

So I think we can calculate harmonic mean and standard deviation as

def hmean(data): return 1/np.mean(1/data) def hstd(data): rmean = np.mean(1/data) rvar = np.var(1/data) return np.sqrt(rvar / (len(data) * rmean**4))

Thanks, I saw that yesterday but wasn't totally sure that would be necessarily correct either. I've made the changes in d120cd3 , thanks for the code as well!

wence-

Minor queries

dask_cuda/benchmarks/local_cudf_shuffle.py

wence- · 2022-06-22T16:28:41Z

dask_cuda/benchmarks/local_cudf_shuffle.py

+    )
+    print_key_value(
+        key="Throughput (Comms only)",
+        value=f"{format_bytes(bandwidths_all.mean())} +/- "


Whereas this bandwidth measure is the mean of the all the p2p data movement observeed by dask?

wence- · 2022-06-22T16:30:02Z

dask_cuda/benchmarks/local_cudf_merge.py

+    print_key_value(
+        key="Throughput (Comms only)",
+        value=f"{format_bytes(bandwidths_all.mean())} +/- "
+        f"{format_bytes(bandwidths_all.std())}",


If args.backend == "explicit-comms" I think these will all be empty (since dask doesn't record p2p in those cases), c.f. lines 329 and onwards below.

Doesn't seem to be the case, although it's possible this is only for normal message passing (e.g., heartbeats):

$ DASK_DISTRIBUTED__COMM__COMPRESSION=None OPENBLAS_NUM_THREADS=1 python dask_cuda/benchmarks/local_cudf_merge.py -p ucx -d 0,1,2,3,4,5,6,7 -c 1_000_000 --runs 30 -b explicit-comms ... Throughput | 1.07 GiB/s +/- 209.63 MiB/s Bandwidth | 103.27 MiB/s +/- 90.80 MiB/s

So this must be background messaging since I don't think that the bandwidth can be lower than the throughput if we're actually measuring data movement of the computation.

It can. Consider for example a problem that processes 10GB of data being one second with 8 workers, or 10GB/s throughput. Communications with all 8 workers could be 90% of the total time at a rate of 100MB/s in each direction, times 8 workers, would still keep bandwidth at only 1600MB/s. Therefore, there's no direct relationship between both, not in this benchmark at least. You could also have a problem with no communication whatsoever, that would give us still some throughput, but bandwidth would be 0.

Despite the above I think you are right and we need to review explicit-comms anyway, over time this benchmark has grown organically, and possibly not always considering the differences posed by explicit-comms.

It can. Consider for example a problem that processes 10GB of data being one second with 8 workers, or 10GB/s throughput. Communications with all 8 workers could be 90% of the total time at a rate of 100MB/s in each direction, times 8 workers, would still keep bandwidth at only 1600MB/s. Therefore, there's no direct relationship between both, not in this benchmark at least. You could also have a problem with no communication whatsoever, that would give us still some throughput, but bandwidth would be 0.

Thanks, wasn't thinking very hard, as you note a more extreme example is that all the data shuffling is local to each worker (high throughput but tiny bandwidth)

Despite the above I think you are right and we need to review explicit-comms anyway, over time this benchmark has grown organically, and possibly not always considering the differences posed by explicit-comms.

Perhaps let's discuss in #940 to pin down exactly what we want to be measuring (since that changes some of the things that are measured/saved)

…omms-throughput

wence- · 2022-06-23T14:37:43Z

dask_cuda/benchmarks/local_cudf_merge.py

    print_separator(separator="=")
    print_key_value(
        key="Throughput",
-        value=f"{format_bytes(t_p.mean())} +/- {format_bytes(t_p.std()) }",
+        value=f"{format_bytes(hmean(t_p))}/s +/- {format_bytes(hstd(t_p))}/s",


I had a check to compare with how standard errors are propagated for reciprocals and these formulae are consistent with that approach. For this case data_processed is the same for every run and the times vary. So we can compute the mean throughput with data_processed / np.mean(times) and the standard deviation with data_processed * np.std(times) / np.mean(times)**2. But these are just special cases of the harmonic approach, so I'm happy.

wence-

nitpicks, but otherwise fine

wence- · 2022-06-23T15:03:55Z

dask_cuda/benchmarks/local_cupy.py

-                t = format_time(took)
-                print_key_value(key=f"{t}", value=f"{npartitions}")
+            times = []
+            for idx, (took, npartitions) in enumerate(took_list):


idx unused.

Done in dc1a587

wence- · 2022-06-23T15:04:18Z

dask_cuda/benchmarks/local_cupy_map_overlap.py

-                t = format_time(took)
-                print_key_value(key=f"{t}", value=f"{npartitions}")
+            times = []
+            for idx, (took, npartitions) in enumerate(took_list):


idx unused

Done in dc1a587

wence- · 2022-06-23T15:16:12Z

dask_cuda/benchmarks/local_cupy.py

+            times = np.asarray(times)
+            bandwidths_all = np.asarray(bandwidths_all)
+            print_separator(separator="=")
+            print_key_value(


I think we can deduce throughput for these operations as well, but it would need to be done for each op in turn, and we'd need to pick a cache model. For example does (A + A.T).sum() move A.nbytes of data or 2*A.nbytes (or more)?

The amount of data that gets moved (if by move you mean the same as transferring between workers) in the cuDF benchmarks is irrelevant for throughput, instead the throughput is based on the data_processed which is a sum of the DataFrames sizes:

dask-cuda/dask_cuda/benchmarks/local_cudf_merge.py

Lines 169 to 170 in a2e3223

data_processed = len(ddf_base) * sum([t.itemsize for t in ddf_base.dtypes])

data_processed += len(ddf_other) * sum([t.itemsize for t in ddf_other.dtypes])

In the same manner, I believe the equivalent data_processed for CuPy benchmarks here would be the sum of sizes of input arrays, e.g., x.nbytes for

dask-cuda/dask_cuda/benchmarks/local_cupy.py

Line 35 in a2e3223

x = rs.random((args.size, args.size), chunks=args.chunk_size).persist()

and x.nbytes + y.nbytes for

dask-cuda/dask_cuda/benchmarks/local_cupy.py

Lines 44 to 45 in a2e3223

x = rs.random((args.size, args.size), chunks=args.chunk_size).persist()

y = rs.random((args.size, args.size), chunks=args.chunk_size).persist()

IOW, A.T is not an input and thus doesn't get accounted for in data_processed. Does the above make sense?

I meant data movement in the local processing, but I guess this is a slightly philosophical argument. I think I was thinking from the point of view of what is the streaming data cost? e.g. if I just compute B = A.T then I need to stream A and stream the output to B so the data movement is 2A.nbytes (on cpus at least it would normally be 3A.nbytes since caches are usually write-allocate without hinting that the output can be done with a non-temporal store). This kind of metric tells us something about the performance on the hardware.

But I agree that a throughput of "this is how much data I started with, how fast can I process it?" is also a good metric, and is generally the one we actually care about since that's the wallclock time question.

In conclusion: happy to go with sum of input array sizes.

Yeah, I agree with your statement on data movement, but I think this can get impractical quite fast when we need to compute for various different operations. Since you agree, for now let's go with the input array sizes then, which is now addressed in dc1a587 .

quasiben

based on @wence-

Probably good to address Lawrence's comments before merging in

pentschev · 2022-06-23T17:11:45Z

@wence- I think all your suggestions/requests have been addressed, do you think we're good to merge?

wence- · 2022-06-24T10:45:27Z

@wence- I think all your suggestions/requests have been addressed, do you think we're good to merge?

Yes, go ahead.

pentschev · 2022-06-24T14:30:50Z

Thanks @wence- and @quasiben for reviews!

pentschev · 2022-06-24T14:30:54Z

@gpucibot merge

pentschev added 2 commits June 22, 2022 05:26

pentschev added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Jun 22, 2022

github-actions bot added the python python code needed label Jun 22, 2022

Rename "comms throughput" to proper name "bandwidth"

aa574be

pentschev changed the title ~~Add communications throughput to benchmarks~~ Add communications bandwidth to benchmarks Jun 22, 2022

Adjust throughput/bandwidth unit

faee309

wence- mentioned this pull request Jun 22, 2022

Benchmark refactoring: tidy data and multi-node capability via --scheduler-file #940

Merged

wence- reviewed Jun 22, 2022

View reviewed changes

pentschev added 4 commits June 22, 2022 12:59

Compute throughput/bandwidth with harmonic mean/std

268f893

Remove lingering mean/std numbers for throughput/bandwidth

0f1ccf6

Fix harmonic standard deviation function

d120cd3

Merge remote-tracking branch 'upstream/branch-22.08' into benchmark-c…

a6e7a9f

…omms-throughput

pentschev marked this pull request as ready for review June 23, 2022 13:18

pentschev requested a review from a team as a code owner June 23, 2022 13:18

wence- reviewed Jun 23, 2022

View reviewed changes

wence- approved these changes Jun 23, 2022

View reviewed changes

quasiben approved these changes Jun 23, 2022

View reviewed changes

Add throughput results to CuPy benchmarks

dc1a587

rapids-bot bot merged commit 695217a into rapidsai:branch-22.08 Jun 24, 2022

pentschev deleted the benchmark-comms-throughput branch October 13, 2022 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add communications bandwidth to benchmarks #938

Add communications bandwidth to benchmarks #938

pentschev commented Jun 22, 2022 •

edited

Loading

codecov-commenter commented Jun 22, 2022 •

edited

Loading

wence- Jun 22, 2022

pentschev Jun 22, 2022

wence- Jun 23, 2022

pentschev Jun 23, 2022

wence- left a comment

wence- Jun 22, 2022

pentschev Jun 22, 2022

wence- Jun 22, 2022

pentschev Jun 22, 2022

wence- Jun 23, 2022

pentschev Jun 23, 2022

wence- Jun 23, 2022

wence- Jun 23, 2022

wence- left a comment

wence- Jun 23, 2022

pentschev Jun 23, 2022

wence- Jun 23, 2022

pentschev Jun 23, 2022

wence- Jun 23, 2022

pentschev Jun 23, 2022 •

edited

Loading

wence- Jun 23, 2022

pentschev Jun 23, 2022

quasiben left a comment

pentschev commented Jun 23, 2022

wence- commented Jun 24, 2022

pentschev commented Jun 24, 2022

pentschev commented Jun 24, 2022

	data_processed = len(ddf_base) * sum([t.itemsize for t in ddf_base.dtypes])
	data_processed += len(ddf_other) * sum([t.itemsize for t in ddf_other.dtypes])

	x = rs.random((args.size, args.size), chunks=args.chunk_size).persist()
	y = rs.random((args.size, args.size), chunks=args.chunk_size).persist()

Add communications bandwidth to benchmarks #938

Add communications bandwidth to benchmarks #938

Conversation

pentschev commented Jun 22, 2022 • edited Loading

codecov-commenter commented Jun 22, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentschev Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quasiben left a comment

Choose a reason for hiding this comment

pentschev commented Jun 23, 2022

wence- commented Jun 24, 2022

pentschev commented Jun 24, 2022

pentschev commented Jun 24, 2022

pentschev commented Jun 22, 2022 •

edited

Loading

codecov-commenter commented Jun 22, 2022 •

edited

Loading

pentschev Jun 23, 2022 •

edited

Loading