Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark utility to perform diff of output from benchmark runs, allowing for precision differences #782

Merged
merged 27 commits into from
Sep 21, 2020

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Sep 16, 2020

This PR adds a convenience method to BenchUtils for comparing the results from two benchmark runs (e.g. verifying that GPU and CPU results match).

This is a simple implementation that pulls results down to the driver for comparison so will only work for data sets that can fit in the driver's memory, but this is likely adequate for most of the benchmarks we are currently running. For example, TPCxBB q5 at 10TB scale produces ~600MB of output.

If data needs sorting before comparison, this is delegated to Spark before collecting the results.

Example usage from spark-shell:

scala> val cpu = spark.read.parquet("/data/andygrove/q5-cpu")
scala> val gpu = spark.read.parquet("/data/andygrove/q5-gpu")
scala> import com.nvidia.spark.rapids.tests.common._
scala> BenchUtils.compareResults(cpu, gpu, ignoreOrdering=true, epsilon=0.0)
Collecting rows from DataFrame
Collected 989754 rows in 7.701 seconds                                               
Collecting rows from DataFrame
Collected 989754 rows in 2.325 seconds                                               
Results match

@andygrove andygrove added the test Only impacts tests label Sep 16, 2020
@andygrove andygrove added this to the Sep 14 - Sep 25 milestone Sep 16, 2020
@andygrove andygrove self-assigned this Sep 16, 2020
@andygrove andygrove requested a review from abellina September 16, 2020 18:37
@@ -0,0 +1,48 @@
package com.nvidia.spark.rapids.tests.common
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright and license header are missing. A build should have flagged this.

def compareResults(df1: DataFrame, df2: DataFrame, epsilon: Double = 0.00001): Unit = {

val result1: Seq[Seq[Any]] = collectAndSort(df1)
val result2: Seq[Seq[Any]] = collectAndSort(df2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if ordering of the results is a significant part of the query output? There are cases where we need to preserve order when comparing. For example, if I run a query where I'm collecting the top 100, having the order be off but the same 100 elements will declare victory when it should not.

Minimally this needs to account for unordered vs. ordered compare.

@andygrove andygrove changed the title [WIP] Benchmark utility to perform diff of output from benchmark runs, allowing for precision differences Benchmark utility to perform diff of output from benchmark runs, allowing for precision differences Sep 16, 2020
}
}

private def collectResults(df: DataFrame, ignoreOrdering: Boolean): Seq[Seq[Any]] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering this is a benchmarking utility, should we also provide an iterator in case the collected results are too big?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark's DataFrame.collect() loads the full result set into the driver and returns an Array. If (when) we need to handle comparisons of larger results I think we would need an alternate approach like converting the results to single files, downloading them, and using Scala code (without Spark) to perform the diff.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a relatively minor change to be able to do this with an iterator. We have that support in the python unit tests. The only issue is that you cannot compare the size of the results ahead of time, because you don't know it yet. But I agree until we hit a situation where we need it there is no point in doing it. Also it is very slow to try and do a comparison like that single threaded. It might be better to truncate floating point values and do a few anti-joins to see if there is something in the left that is not in the right, and vise versa. This might not handle duplicate rows, so we might need something there too, but it would scale a lot better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know more about the iterator approach. I did look at the toLocalIterator method but this seems to load one partition at a time so I don't think that would work when we need ordering across the result set. Could you point me to an example where we do this in the Python tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It loads one partition of the result at a time, so it does preserve ordering. It is just rather slow because we are doing the comparison single threaded.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkout mortgage_test.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the option to use an iterator and have been testing this out and it seems to work well.

val t1 = System.currentTimeMillis()
val rows = if (ignoreOrdering) {
// let Spark do the sorting
df.sort(df.columns.map(col): _*).collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the downside to this is if sort is broken, is this forcing cpu sort for instance?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only known issue with GPU sort vs CPU sort is #84 Which I don't think we will ever run into outside of artificially generated data sets, which is why we have not pushed to fix it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the downside to this is if sort is broken, is this forcing cpu sort for instance?

The job config is outside of the tool. I think to verify the results we should let standard cpu spark do it, nothing stops you from running this with the plugin, so it's up to the user at this point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we have other tests for correctness so ignore my comment, and hopefully the GPU sort should be faster.

@tgravescs
Copy link
Collaborator

if this is useful by itself it might be nice to document in the readme, otherwise it can wait til more stuff is in place.

* precision.
*
* This is only suitable for data sets that can fit in the driver's memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add in a note about how this could add in a sort and does a collect so for benchmark results to be accurate in isolation it is probably best to run the benchmark with the output going to a file and this compares the results of that output after timing has stopped, or that this is used to compare results for a single run after more timings have been collected for other runs.

Signed-off-by: Andy Grove <[email protected]>
@andygrove
Copy link
Contributor Author

if this is useful by itself it might be nice to document in the readme, otherwise it can wait til more stuff is in place.

I agree. I filed #794 for writing benchmark documentation.

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

Signed-off-by: Andy Grove <[email protected]>
@andygrove
Copy link
Contributor Author

build

@abellina abellina merged commit fdf731c into NVIDIA:branch-0.3 Sep 21, 2020
@abellina
Copy link
Collaborator

I merged this too quickly, sorry. If someone had pending concerns please let me know, I can revert, or help fix issues.

@sameerz sameerz added benchmark Benchmarking, benchmarking tools and removed test Only impacts tests labels Oct 4, 2020
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request Nov 20, 2020
…wing for precision differences (NVIDIA#782)

* Benchmark automation POC

Signed-off-by: Andy Grove <[email protected]>

* Default to collecting all results

Signed-off-by: Andy Grove <[email protected]>

* address feedback

* make results action configurable, allowing for results to be written to parquet/csv

* fix typo in javadoc

Signed-off-by: Andy Grove <[email protected]>

* remove unused imports

Signed-off-by: Andy Grove <[email protected]>

* Remove cold/hot run loops since they were causing confusion

Signed-off-by: Andy Grove <[email protected]>

* gc between runs

Signed-off-by: Andy Grove <[email protected]>

* Make gc optional

Signed-off-by: Andy Grove <[email protected]>

* update test

Signed-off-by: Andy Grove <[email protected]>

* Provide specific benchmark methods for collect vs write to CSV or Parquet

Signed-off-by: Andy Grove <[email protected]>

* provide convenience methods to run benchmarks and store action in json

Signed-off-by: Andy Grove <[email protected]>

* Add utility method to perform a diff of the data collected from two DataFrames

Signed-off-by: Andy Grove <[email protected]>

* Add missing license header and remove unused import

Signed-off-by: Andy Grove <[email protected]>

* Provide option to ignore ordering of results

Signed-off-by: Andy Grove <[email protected]>

* revert change to compare method

Signed-off-by: Andy Grove <[email protected]>

* remove rmm_log.txt

Signed-off-by: Andy Grove <[email protected]>

* optimize sorting

* let spark sort the data before collecting it

Signed-off-by: Andy Grove <[email protected]>

* fix message

Signed-off-by: Andy Grove <[email protected]>

* Improved documentation

Signed-off-by: Andy Grove <[email protected]>

* bug fix and optimization to non-iterator case

Signed-off-by: Andy Grove <[email protected]>

* fix typo in javadoc

Signed-off-by: Andy Grove <[email protected]>

* fail fast if row counts do not match

Signed-off-by: Andy Grove <[email protected]>

* remove redundant logic

Signed-off-by: Andy Grove <[email protected]>

* scalastyle

Signed-off-by: Andy Grove <[email protected]>
@andygrove andygrove deleted the diff-results branch December 17, 2020 15:27
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…wing for precision differences (NVIDIA#782)

* Benchmark automation POC

Signed-off-by: Andy Grove <[email protected]>

* Default to collecting all results

Signed-off-by: Andy Grove <[email protected]>

* address feedback

* make results action configurable, allowing for results to be written to parquet/csv

* fix typo in javadoc

Signed-off-by: Andy Grove <[email protected]>

* remove unused imports

Signed-off-by: Andy Grove <[email protected]>

* Remove cold/hot run loops since they were causing confusion

Signed-off-by: Andy Grove <[email protected]>

* gc between runs

Signed-off-by: Andy Grove <[email protected]>

* Make gc optional

Signed-off-by: Andy Grove <[email protected]>

* update test

Signed-off-by: Andy Grove <[email protected]>

* Provide specific benchmark methods for collect vs write to CSV or Parquet

Signed-off-by: Andy Grove <[email protected]>

* provide convenience methods to run benchmarks and store action in json

Signed-off-by: Andy Grove <[email protected]>

* Add utility method to perform a diff of the data collected from two DataFrames

Signed-off-by: Andy Grove <[email protected]>

* Add missing license header and remove unused import

Signed-off-by: Andy Grove <[email protected]>

* Provide option to ignore ordering of results

Signed-off-by: Andy Grove <[email protected]>

* revert change to compare method

Signed-off-by: Andy Grove <[email protected]>

* remove rmm_log.txt

Signed-off-by: Andy Grove <[email protected]>

* optimize sorting

* let spark sort the data before collecting it

Signed-off-by: Andy Grove <[email protected]>

* fix message

Signed-off-by: Andy Grove <[email protected]>

* Improved documentation

Signed-off-by: Andy Grove <[email protected]>

* bug fix and optimization to non-iterator case

Signed-off-by: Andy Grove <[email protected]>

* fix typo in javadoc

Signed-off-by: Andy Grove <[email protected]>

* fail fast if row counts do not match

Signed-off-by: Andy Grove <[email protected]>

* remove redundant logic

Signed-off-by: Andy Grove <[email protected]>

* scalastyle

Signed-off-by: Andy Grove <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…wing for precision differences (NVIDIA#782)

* Benchmark automation POC

Signed-off-by: Andy Grove <[email protected]>

* Default to collecting all results

Signed-off-by: Andy Grove <[email protected]>

* address feedback

* make results action configurable, allowing for results to be written to parquet/csv

* fix typo in javadoc

Signed-off-by: Andy Grove <[email protected]>

* remove unused imports

Signed-off-by: Andy Grove <[email protected]>

* Remove cold/hot run loops since they were causing confusion

Signed-off-by: Andy Grove <[email protected]>

* gc between runs

Signed-off-by: Andy Grove <[email protected]>

* Make gc optional

Signed-off-by: Andy Grove <[email protected]>

* update test

Signed-off-by: Andy Grove <[email protected]>

* Provide specific benchmark methods for collect vs write to CSV or Parquet

Signed-off-by: Andy Grove <[email protected]>

* provide convenience methods to run benchmarks and store action in json

Signed-off-by: Andy Grove <[email protected]>

* Add utility method to perform a diff of the data collected from two DataFrames

Signed-off-by: Andy Grove <[email protected]>

* Add missing license header and remove unused import

Signed-off-by: Andy Grove <[email protected]>

* Provide option to ignore ordering of results

Signed-off-by: Andy Grove <[email protected]>

* revert change to compare method

Signed-off-by: Andy Grove <[email protected]>

* remove rmm_log.txt

Signed-off-by: Andy Grove <[email protected]>

* optimize sorting

* let spark sort the data before collecting it

Signed-off-by: Andy Grove <[email protected]>

* fix message

Signed-off-by: Andy Grove <[email protected]>

* Improved documentation

Signed-off-by: Andy Grove <[email protected]>

* bug fix and optimization to non-iterator case

Signed-off-by: Andy Grove <[email protected]>

* fix typo in javadoc

Signed-off-by: Andy Grove <[email protected]>

* fail fast if row counts do not match

Signed-off-by: Andy Grove <[email protected]>

* remove redundant logic

Signed-off-by: Andy Grove <[email protected]>

* scalastyle

Signed-off-by: Andy Grove <[email protected]>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
[auto-merge] bot-auto-merge-branch-22.12 to branch-23.02 [skip ci] [bot]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Benchmarking, benchmarking tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants