[DevX] Commits comparison via dashboard is misleading #7986

guangy10 · 2025-01-27T23:45:18Z

🐛 Describe the bug

When comparing perfs between commits via the ExecuTorch dashbaord, things being displayed can be misleading.

For example, to view the perf changes for Llama-3.2 with fb16 config between e00eaea98f from 1/20 and 576108322a from 1/26 (dashboard),

Users will see lots of data dropped to 0, which gives the impression of that those jobs failed, as shown in the attached image.

However, if we take a deep look into the commit 576108322a, we will notice that from that commit, only apple-perf is triggered. As comparison, both android-perf and apple-perf are triggered from e00eaea98f the base commit. It's expected that android benchmark jobs and apple benchmark jobs can be scheduled independently as they are managed by different github workflows, but this is the root cause of seeing many data dropped to 0 when comparing perfs between commits via the ExecuTorch dashbaord.

I think there are two ways to address this issue:

Hide the items that don't triggered from displaying on the dashboard. Note that we do want to display the data if a config do run but fail. The dashboard should catch it as a real regression.
Make a synchronized run between android-perf and apple-perf. This could make it consistent when comparing commits on main, however, when comparing commits from on-demand runs where typically only runs with the selected model and config, we will hit this issue again.

Versions

trunk

The text was updated successfully, but these errors were encountered:

guangy10 · 2025-01-28T00:01:39Z

Subscribe @digantdesai @kimishpatel

guangy10 · 2025-01-28T19:30:55Z

A more alarming example can be found here (dashboard), where it gave users the mistaken impression that the infrastructure was completely down.

huydhn · 2025-01-29T21:38:20Z

I think showing both Apple and Android perf bring more trouble that it worth. This can be fixed easily by choosing only Apple and Android at a time. I guess we don't need to compare perf across Apple and Android here

guangy10 · 2025-01-30T21:32:03Z

I think showing both Apple and Android perf bring more trouble that it worth. This can be fixed easily by choosing only Apple and Android at a time. I guess we don't need to compare perf across Apple and Android here

@huydhn ~~No, we don't need to compare perf across Apple and Android here~~, but we do have some commits that both Apple and Android are triggered. So in option 1 I think what we can do is that we probably don't need to know what gets run on which commit when pulling the data from the database. If the data point for a specific model+config doesn't exist on both commits, hide it from the dashboard. If it runs on both commits but only passes on one, depending on whether it's on base or not, display it as a legitimate new success/failure.

Updated: After thinking it more carefully I do recognize concrete needs require viewing both iOS and Android models on the same board.

For ET developers, delegates like xnnpack is supported on both platforms, the value of viewing the same model+config on the same board would help developers quickly identify the platform-specific issues, e.g. some kernels are running slow only on iOS.
For managers, they may want to quote/share the dashboard in the view where they can see everything by default. It will help with align some work with external vendors. For example, given the same model, if QNN (Android) delegate is running much slower than CoreML (Apple), we may want to signal it to vendors and motivate them to improve the perf.

@cbilgin @kimishpatel @digantdesai free feel to add more about how you want to utilize the dashboard for perf tracking

huydhn · 2025-01-30T21:40:40Z

What do you think about this new view to fix the issue https://torchci-git-fork-huydhn-tweak-et-dashboard-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch? It's from this change pytorch/test-infra#6234 and I think it helps remove most of the confusion.

If the data point for a specific model+config doesn't exist on both commits, hide it from the dashboard. If it runs on both commits but only passes on one, depending on whether it's on base or not, display it as a legitimate new success/failure.

Yup, I could also add this if needed, but probably in a separate PR later.

digantdesai · 2025-01-31T19:22:00Z

Thinking out loud, I guess I like option (1) for "normal experiments". That way we users see what they care about, assuming users know what they want to run on.

For cron runs we can do (2), and treat missing as failures, and somehow filter out those runs for better cross backend analysis?

kimishpatel · 2025-02-01T04:46:02Z

What do you think about this new view to fix the issue https://torchci-git-fork-huydhn-tweak-et-dashboard-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch? It's from this change pytorch/test-infra#6234 and I think it helps remove most of the confusion.

If the data point for a specific model+config doesn't exist on both commits, hide it from the dashboard. If it runs on both commits but only passes on one, depending on whether it's on base or not, display it as a legitimate new success/failure.

Yup, I could also add this if needed, but probably in a separate PR later.

@huydhn I dont follow what changed. I still see numbers dropping to zero

In general I agree with both the points @guangy10 made. I have run into similar issues where I didnt understand what "0" meant.

Also agree that for user requested runs failures should be explicitly marked such.

With respect to continuous cron jobs I think we should have workflows that keep the dashboard up-to-date as much as possible. Else my worry is that we wont have complete data and we can have a few commits where jobs continue to fail and we dont catch regression (which we would have caught if the job succeeded). I recognize this is not always possible, but

if we have failure on more than one or two commits, we should definitely trigger an alert.
And definitely filter out such commits from dashboard charts that not devs but managers may look

guangy10 · 2025-02-01T05:36:35Z

Oh it’s likely closed by mistake. The merged PR only addressed part of #7986 as stated in the summary, which add an option to filter by platform. I think Huy mentioned there will be a separate PR to address the what described in this issue.

huydhn · 2025-02-01T08:28:55Z

Yeah, pytorch/test-infra#6234 wasn't meant to close this. After chatting with @guangy10, we decide to land pytorch/test-infra#6234 first because it's an useful tweak, while the fix will be in a follow-up PR.

Thank @guangy10 for reopening this.

guangy10 added enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: benchmark Issues related to the benchmark infrastructure labels Jan 27, 2025

guangy10 assigned huydhn Jan 27, 2025

huydhn added this to PyTorch OSS Dev Infra Jan 28, 2025

manuelcandales added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 28, 2025

huydhn mentioned this issue Jan 30, 2025

Add platform (arch) for ExecuTorch dashboard pytorch/test-infra#6234

Merged

huydhn closed this as completed in pytorch/test-infra#6234 Jan 31, 2025

huydhn closed this as completed in pytorch/test-infra@1bc53d7 Jan 31, 2025

github-project-automation bot moved this to Done in PyTorch OSS Dev Infra Jan 31, 2025

guangy10 reopened this Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DevX] Commits comparison via dashboard is misleading #7986

[DevX] Commits comparison via dashboard is misleading #7986

guangy10 commented Jan 27, 2025

guangy10 commented Jan 28, 2025

guangy10 commented Jan 28, 2025

huydhn commented Jan 29, 2025

guangy10 commented Jan 30, 2025 •

edited

Loading

huydhn commented Jan 30, 2025 •

edited

Loading

digantdesai commented Jan 31, 2025

kimishpatel commented Feb 1, 2025

guangy10 commented Feb 1, 2025

huydhn commented Feb 1, 2025 •

edited

Loading

[DevX] Commits comparison via dashboard is misleading #7986

[DevX] Commits comparison via dashboard is misleading #7986

Comments

guangy10 commented Jan 27, 2025

🐛 Describe the bug

Versions

guangy10 commented Jan 28, 2025

guangy10 commented Jan 28, 2025

huydhn commented Jan 29, 2025

guangy10 commented Jan 30, 2025 • edited Loading

huydhn commented Jan 30, 2025 • edited Loading

digantdesai commented Jan 31, 2025

kimishpatel commented Feb 1, 2025

guangy10 commented Feb 1, 2025

huydhn commented Feb 1, 2025 • edited Loading

guangy10 commented Jan 30, 2025 •

edited

Loading

huydhn commented Jan 30, 2025 •

edited

Loading

huydhn commented Feb 1, 2025 •

edited

Loading