fix duplicate counted metrics like op time for GpuCoalesceBatches #11062

binmahone · 2024-06-14T06:48:25Z

We observed that op time metrics for GpuCoalesceBatches is often larger than expected.
After some digging I found that in some cases GpuMetrics will be duplicately counted. (a test case example is shown in the PR)

I tried to to fix the duplicate issue GpuCoalesceBatches, but I have no idea how many other duplicates are there for other operators. So I refined NvtxWithMetrics, MetricRange and GpuMetric#ns a little bit to avoid duplicate counting elapsed time.

closes #11063

… dup Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2024-06-14T07:04:29Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuExec.scala

revans2 · 2024-06-14T13:42:02Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/NvtxWithMetrics.scala

@@ -27,31 +29,75 @@ object NvtxWithMetrics {
  }
 }

+object ThreadLocalMetrics {
+  val addressOrdering: Ordering[GpuMetric] = Ordering.by(System.identityHashCode(_))


Would it be simpler to have each GpuMetric track itself? This feels like we are adding in a lot of overhead with the ThreadLocal values.

sealed abstract class GpuMetric extends Serializable { def value: Long def set(v: Long): Unit def +=(v: Long): Unit def add(v: Long): Unit private var isTimerActive = false final def tryActivateTimer(): Bool = { if (!isTimerActive) { isTimerActive = true true } else { // output a warning if this is not NoopMetric false } } final def deactivateTimer(duration: Long): Unit = { if (isActive) { isActive = false add(duration) } } final def ns[T](f: => T): T = { if (tryActivateTimer()) { val start = System.nanoTime() try { f } finally { deactivateTimer(System.nanoTime() - start) } } } }

I should add that this is mostly about code complexity. I don't think the performance difference will be significant.

Hi Bobby, refined the PR accordingly. Threadlocal removed

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2024-06-17T02:30:11Z

build

* with call site print, not good because some test cases by design will dup Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * done Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * add file Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * fix comiple Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * address review comments Signed-off-by: Hongbin Ma (Mahone) <[email protected]> --------- Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…IDIA#11062) * with call site print, not good because some test cases by design will dup Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * done Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * add file Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * fix comiple Signed-off-by: Hongbin Ma (Mahone) <[email protected]> * address review comments Signed-off-by: Hongbin Ma (Mahone) <[email protected]> --------- Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone added 3 commits June 14, 2024 14:14

with call site print, not good because some test cases by design will…

dc2004e

… dup Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

done

0ba8e9b

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

add file

8b3326e

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone changed the title ~~240614 fixing metric dup~~ fix duplicate counted metrics like op time for GpuCoalesceBatches Jun 14, 2024

binmahone requested review from firestarman, revans2 and HaoYang670 and removed request for firestarman and revans2 June 14, 2024 07:02

fix comiple

441f9c0

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

firestarman reviewed Jun 14, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuExec.scala Outdated Show resolved Hide resolved

revans2 reviewed Jun 14, 2024

View reviewed changes

address review comments

8be7ddf

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

This was referenced Jun 17, 2024

#11062 for liyuan branch #11071

Closed

#11062 for nvliyuan:0612-base-local nvliyuan/yuali-spark-rapids#16

Merged

revans2 approved these changes Jun 24, 2024

View reviewed changes

wjxiz1992 merged commit 7a8690f into NVIDIA:branch-24.08 Jun 25, 2024
45 checks passed

sameerz added the bug Something isn't working label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix duplicate counted metrics like op time for GpuCoalesceBatches #11062

fix duplicate counted metrics like op time for GpuCoalesceBatches #11062

binmahone commented Jun 14, 2024 •

edited

Loading

binmahone commented Jun 14, 2024

revans2 Jun 14, 2024

revans2 Jun 14, 2024

binmahone Jun 17, 2024

binmahone commented Jun 17, 2024

fix duplicate counted metrics like op time for GpuCoalesceBatches #11062

fix duplicate counted metrics like op time for GpuCoalesceBatches #11062

Conversation

binmahone commented Jun 14, 2024 • edited Loading

binmahone commented Jun 14, 2024

revans2 Jun 14, 2024

Choose a reason for hiding this comment

revans2 Jun 14, 2024

Choose a reason for hiding this comment

binmahone Jun 17, 2024

Choose a reason for hiding this comment

binmahone commented Jun 17, 2024

binmahone commented Jun 14, 2024 •

edited

Loading