-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark utility to perform diff of output from benchmark runs, allowing for precision differences #782
Conversation
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
…quet Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
…ataFrames Signed-off-by: Andy Grove <[email protected]>
integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/common/BenchUtils.scala
Show resolved
Hide resolved
@@ -0,0 +1,48 @@ | |||
package com.nvidia.spark.rapids.tests.common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyright and license header are missing. A build should have flagged this.
def compareResults(df1: DataFrame, df2: DataFrame, epsilon: Double = 0.00001): Unit = { | ||
|
||
val result1: Seq[Seq[Any]] = collectAndSort(df1) | ||
val result2: Seq[Seq[Any]] = collectAndSort(df2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if ordering of the results is a significant part of the query output? There are cases where we need to preserve order when comparing. For example, if I run a query where I'm collecting the top 100, having the order be off but the same 100 elements will declare victory when it should not.
Minimally this needs to account for unordered vs. ordered compare.
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
} | ||
} | ||
|
||
private def collectResults(df: DataFrame, ignoreOrdering: Boolean): Seq[Seq[Any]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering this is a benchmarking utility, should we also provide an iterator in case the collected results are too big?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark's DataFrame.collect()
loads the full result set into the driver and returns an Array
. If (when) we need to handle comparisons of larger results I think we would need an alternate approach like converting the results to single files, downloading them, and using Scala code (without Spark) to perform the diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a relatively minor change to be able to do this with an iterator. We have that support in the python unit tests. The only issue is that you cannot compare the size of the results ahead of time, because you don't know it yet. But I agree until we hit a situation where we need it there is no point in doing it. Also it is very slow to try and do a comparison like that single threaded. It might be better to truncate floating point values and do a few anti-joins to see if there is something in the left that is not in the right, and vise versa. This might not handle duplicate rows, so we might need something there too, but it would scale a lot better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know more about the iterator approach. I did look at the toLocalIterator
method but this seems to load one partition at a time so I don't think that would work when we need ordering across the result set. Could you point me to an example where we do this in the Python tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It loads one partition of the result at a time, so it does preserve ordering. It is just rather slow because we are doing the comparison single threaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkout mortgage_test.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the option to use an iterator and have been testing this out and it seems to work well.
val t1 = System.currentTimeMillis() | ||
val rows = if (ignoreOrdering) { | ||
// let Spark do the sorting | ||
df.sort(df.columns.map(col): _*).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the downside to this is if sort is broken, is this forcing cpu sort for instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only known issue with GPU sort vs CPU sort is #84 Which I don't think we will ever run into outside of artificially generated data sets, which is why we have not pushed to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the downside to this is if sort is broken, is this forcing cpu sort for instance?
The job config is outside of the tool. I think to verify the results we should let standard cpu spark do it, nothing stops you from running this with the plugin, so it's up to the user at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we have other tests for correctness so ignore my comment, and hopefully the GPU sort should be faster.
if this is useful by itself it might be nice to document in the readme, otherwise it can wait til more stuff is in place. |
* precision. | ||
* | ||
* This is only suitable for data sets that can fit in the driver's memory. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add in a note about how this could add in a sort and does a collect so for benchmark results to be accurate in isolation it is probably best to run the benchmark with the output going to a file and this compares the results of that output after timing has stopped, or that this is used to compare results for a single run after more timings have been collected for other runs.
Signed-off-by: Andy Grove <[email protected]>
I agree. I filed #794 for writing benchmark documentation. |
build |
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
build |
Signed-off-by: Andy Grove <[email protected]>
build |
I merged this too quickly, sorry. If someone had pending concerns please let me know, I can revert, or help fix issues. |
…wing for precision differences (NVIDIA#782) * Benchmark automation POC Signed-off-by: Andy Grove <[email protected]> * Default to collecting all results Signed-off-by: Andy Grove <[email protected]> * address feedback * make results action configurable, allowing for results to be written to parquet/csv * fix typo in javadoc Signed-off-by: Andy Grove <[email protected]> * remove unused imports Signed-off-by: Andy Grove <[email protected]> * Remove cold/hot run loops since they were causing confusion Signed-off-by: Andy Grove <[email protected]> * gc between runs Signed-off-by: Andy Grove <[email protected]> * Make gc optional Signed-off-by: Andy Grove <[email protected]> * update test Signed-off-by: Andy Grove <[email protected]> * Provide specific benchmark methods for collect vs write to CSV or Parquet Signed-off-by: Andy Grove <[email protected]> * provide convenience methods to run benchmarks and store action in json Signed-off-by: Andy Grove <[email protected]> * Add utility method to perform a diff of the data collected from two DataFrames Signed-off-by: Andy Grove <[email protected]> * Add missing license header and remove unused import Signed-off-by: Andy Grove <[email protected]> * Provide option to ignore ordering of results Signed-off-by: Andy Grove <[email protected]> * revert change to compare method Signed-off-by: Andy Grove <[email protected]> * remove rmm_log.txt Signed-off-by: Andy Grove <[email protected]> * optimize sorting * let spark sort the data before collecting it Signed-off-by: Andy Grove <[email protected]> * fix message Signed-off-by: Andy Grove <[email protected]> * Improved documentation Signed-off-by: Andy Grove <[email protected]> * bug fix and optimization to non-iterator case Signed-off-by: Andy Grove <[email protected]> * fix typo in javadoc Signed-off-by: Andy Grove <[email protected]> * fail fast if row counts do not match Signed-off-by: Andy Grove <[email protected]> * remove redundant logic Signed-off-by: Andy Grove <[email protected]> * scalastyle Signed-off-by: Andy Grove <[email protected]>
…wing for precision differences (NVIDIA#782) * Benchmark automation POC Signed-off-by: Andy Grove <[email protected]> * Default to collecting all results Signed-off-by: Andy Grove <[email protected]> * address feedback * make results action configurable, allowing for results to be written to parquet/csv * fix typo in javadoc Signed-off-by: Andy Grove <[email protected]> * remove unused imports Signed-off-by: Andy Grove <[email protected]> * Remove cold/hot run loops since they were causing confusion Signed-off-by: Andy Grove <[email protected]> * gc between runs Signed-off-by: Andy Grove <[email protected]> * Make gc optional Signed-off-by: Andy Grove <[email protected]> * update test Signed-off-by: Andy Grove <[email protected]> * Provide specific benchmark methods for collect vs write to CSV or Parquet Signed-off-by: Andy Grove <[email protected]> * provide convenience methods to run benchmarks and store action in json Signed-off-by: Andy Grove <[email protected]> * Add utility method to perform a diff of the data collected from two DataFrames Signed-off-by: Andy Grove <[email protected]> * Add missing license header and remove unused import Signed-off-by: Andy Grove <[email protected]> * Provide option to ignore ordering of results Signed-off-by: Andy Grove <[email protected]> * revert change to compare method Signed-off-by: Andy Grove <[email protected]> * remove rmm_log.txt Signed-off-by: Andy Grove <[email protected]> * optimize sorting * let spark sort the data before collecting it Signed-off-by: Andy Grove <[email protected]> * fix message Signed-off-by: Andy Grove <[email protected]> * Improved documentation Signed-off-by: Andy Grove <[email protected]> * bug fix and optimization to non-iterator case Signed-off-by: Andy Grove <[email protected]> * fix typo in javadoc Signed-off-by: Andy Grove <[email protected]> * fail fast if row counts do not match Signed-off-by: Andy Grove <[email protected]> * remove redundant logic Signed-off-by: Andy Grove <[email protected]> * scalastyle Signed-off-by: Andy Grove <[email protected]>
…wing for precision differences (NVIDIA#782) * Benchmark automation POC Signed-off-by: Andy Grove <[email protected]> * Default to collecting all results Signed-off-by: Andy Grove <[email protected]> * address feedback * make results action configurable, allowing for results to be written to parquet/csv * fix typo in javadoc Signed-off-by: Andy Grove <[email protected]> * remove unused imports Signed-off-by: Andy Grove <[email protected]> * Remove cold/hot run loops since they were causing confusion Signed-off-by: Andy Grove <[email protected]> * gc between runs Signed-off-by: Andy Grove <[email protected]> * Make gc optional Signed-off-by: Andy Grove <[email protected]> * update test Signed-off-by: Andy Grove <[email protected]> * Provide specific benchmark methods for collect vs write to CSV or Parquet Signed-off-by: Andy Grove <[email protected]> * provide convenience methods to run benchmarks and store action in json Signed-off-by: Andy Grove <[email protected]> * Add utility method to perform a diff of the data collected from two DataFrames Signed-off-by: Andy Grove <[email protected]> * Add missing license header and remove unused import Signed-off-by: Andy Grove <[email protected]> * Provide option to ignore ordering of results Signed-off-by: Andy Grove <[email protected]> * revert change to compare method Signed-off-by: Andy Grove <[email protected]> * remove rmm_log.txt Signed-off-by: Andy Grove <[email protected]> * optimize sorting * let spark sort the data before collecting it Signed-off-by: Andy Grove <[email protected]> * fix message Signed-off-by: Andy Grove <[email protected]> * Improved documentation Signed-off-by: Andy Grove <[email protected]> * bug fix and optimization to non-iterator case Signed-off-by: Andy Grove <[email protected]> * fix typo in javadoc Signed-off-by: Andy Grove <[email protected]> * fail fast if row counts do not match Signed-off-by: Andy Grove <[email protected]> * remove redundant logic Signed-off-by: Andy Grove <[email protected]> * scalastyle Signed-off-by: Andy Grove <[email protected]>
[auto-merge] bot-auto-merge-branch-22.12 to branch-23.02 [skip ci] [bot]
This PR adds a convenience method to
BenchUtils
for comparing the results from two benchmark runs (e.g. verifying that GPU and CPU results match).This is a simple implementation that pulls results down to the driver for comparison so will only work for data sets that can fit in the driver's memory, but this is likely adequate for most of the benchmarks we are currently running. For example, TPCxBB q5 at 10TB scale produces ~600MB of output.
If data needs sorting before comparison, this is delegated to Spark before collecting the results.
Example usage from spark-shell: