Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix duplicate counted metrics like op time for GpuCoalesceBatches #11062

Merged
merged 5 commits into from
Jun 25, 2024

Conversation

binmahone
Copy link
Collaborator

@binmahone binmahone commented Jun 14, 2024

We observed that op time metrics for GpuCoalesceBatches is often larger than expected.
After some digging I found that in some cases GpuMetrics will be duplicately counted. (a test case example is shown in the PR)

I tried to to fix the duplicate issue GpuCoalesceBatches, but I have no idea how many other duplicates are there for other operators. So I refined NvtxWithMetrics, MetricRange and GpuMetric#ns a little bit to avoid duplicate counting elapsed time.

closes #11063

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@binmahone binmahone changed the title 240614 fixing metric dup fix duplicate counted metrics like op time for GpuCoalesceBatches Jun 14, 2024
@binmahone binmahone requested review from firestarman, revans2 and HaoYang670 and removed request for firestarman and revans2 June 14, 2024 07:02
Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@binmahone
Copy link
Collaborator Author

build

@@ -27,31 +29,75 @@ object NvtxWithMetrics {
}
}

object ThreadLocalMetrics {
val addressOrdering: Ordering[GpuMetric] = Ordering.by(System.identityHashCode(_))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to have each GpuMetric track itself? This feels like we are adding in a lot of overhead with the ThreadLocal values.

sealed abstract class GpuMetric extends Serializable {
  def value: Long
  def set(v: Long): Unit
  def +=(v: Long): Unit
  def add(v: Long): Unit

  private var isTimerActive = false

  final def tryActivateTimer(): Bool = {
    if (!isTimerActive) {
      isTimerActive = true
      true
    } else {
      // output a warning if this is not NoopMetric
      false
    }
  }

  final def deactivateTimer(duration: Long): Unit = {
    if (isActive) {
      isActive = false
      add(duration)
    }
  }

  final def ns[T](f: => T): T = {
     if (tryActivateTimer()) {
      val start = System.nanoTime()
      try {
        f
      } finally {
        deactivateTimer(System.nanoTime() - start)
      }
    }
  }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add that this is mostly about code complexity. I don't think the performance difference will be significant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Bobby, refined the PR accordingly. Threadlocal removed

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@binmahone
Copy link
Collaborator Author

build

wjxiz1992 pushed a commit to nvliyuan/yuali-spark-rapids that referenced this pull request Jun 17, 2024
* with call site print, not good because some test cases by design will dup

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* done

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* add file

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* fix comiple

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* address review comments

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

---------

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
nvliyuan pushed a commit to nvliyuan/yuali-spark-rapids that referenced this pull request Jun 17, 2024
* with call site print, not good because some test cases by design will dup

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* done

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* add file

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* fix comiple

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* address review comments

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

---------

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@wjxiz1992 wjxiz1992 merged commit 7a8690f into NVIDIA:branch-24.08 Jun 25, 2024
45 checks passed
@sameerz sameerz added the bug Something isn't working label Jun 25, 2024
SurajAralihalli pushed a commit to SurajAralihalli/spark-rapids that referenced this pull request Jul 12, 2024
…IDIA#11062)

* with call site print, not good because some test cases by design will dup

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* done

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* add file

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* fix comiple

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

* address review comments

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

---------

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] op time for GpuCoalesceBatches is more than actual
5 participants