Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DevX] Commits comparison via dashboard is misleading #7986

Open
guangy10 opened this issue Jan 27, 2025 · 9 comments · Fixed by pytorch/test-infra#6234
Open

[DevX] Commits comparison via dashboard is misleading #7986

guangy10 opened this issue Jan 27, 2025 · 9 comments · Fixed by pytorch/test-infra#6234
Assignees
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: benchmark Issues related to the benchmark infrastructure triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@guangy10
Copy link
Contributor

🐛 Describe the bug

When comparing perfs between commits via the ExecuTorch dashbaord, things being displayed can be misleading.

For example, to view the perf changes for Llama-3.2 with fb16 config between e00eaea98f from 1/20 and 576108322a from 1/26 (dashboard),

Users will see lots of data dropped to 0, which gives the impression of that those jobs failed, as shown in the attached image.

Image

However, if we take a deep look into the commit 576108322a, we will notice that from that commit, only apple-perf is triggered. As comparison, both android-perf and apple-perf are triggered from e00eaea98f the base commit. It's expected that android benchmark jobs and apple benchmark jobs can be scheduled independently as they are managed by different github workflows, but this is the root cause of seeing many data dropped to 0 when comparing perfs between commits via the ExecuTorch dashbaord.

I think there are two ways to address this issue:

  1. Hide the items that don't triggered from displaying on the dashboard. Note that we do want to display the data if a config do run but fail. The dashboard should catch it as a real regression.
  2. Make a synchronized run between android-perf and apple-perf. This could make it consistent when comparing commits on main, however, when comparing commits from on-demand runs where typically only runs with the selected model and config, we will hit this issue again.

Versions

trunk

@guangy10 guangy10 added enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: benchmark Issues related to the benchmark infrastructure labels Jan 27, 2025
@guangy10
Copy link
Contributor Author

Subscribe @digantdesai @kimishpatel

@manuelcandales manuelcandales added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 28, 2025
@guangy10
Copy link
Contributor Author

A more alarming example can be found here (dashboard), where it gave users the mistaken impression that the infrastructure was completely down.

@huydhn
Copy link
Contributor

huydhn commented Jan 29, 2025

I think showing both Apple and Android perf bring more trouble that it worth. This can be fixed easily by choosing only Apple and Android at a time. I guess we don't need to compare perf across Apple and Android here

@guangy10
Copy link
Contributor Author

guangy10 commented Jan 30, 2025

I think showing both Apple and Android perf bring more trouble that it worth. This can be fixed easily by choosing only Apple and Android at a time. I guess we don't need to compare perf across Apple and Android here

@huydhn No, we don't need to compare perf across Apple and Android here, but we do have some commits that both Apple and Android are triggered. So in option 1 I think what we can do is that we probably don't need to know what gets run on which commit when pulling the data from the database. If the data point for a specific model+config doesn't exist on both commits, hide it from the dashboard. If it runs on both commits but only passes on one, depending on whether it's on base or not, display it as a legitimate new success/failure.

Updated: After thinking it more carefully I do recognize concrete needs require viewing both iOS and Android models on the same board.

  • For ET developers, delegates like xnnpack is supported on both platforms, the value of viewing the same model+config on the same board would help developers quickly identify the platform-specific issues, e.g. some kernels are running slow only on iOS.
  • For managers, they may want to quote/share the dashboard in the view where they can see everything by default. It will help with align some work with external vendors. For example, given the same model, if QNN (Android) delegate is running much slower than CoreML (Apple), we may want to signal it to vendors and motivate them to improve the perf.

@cbilgin @kimishpatel @digantdesai free feel to add more about how you want to utilize the dashboard for perf tracking

@huydhn
Copy link
Contributor

huydhn commented Jan 30, 2025

What do you think about this new view to fix the issue https://torchci-git-fork-huydhn-tweak-et-dashboard-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch? It's from this change pytorch/test-infra#6234 and I think it helps remove most of the confusion.

If the data point for a specific model+config doesn't exist on both commits, hide it from the dashboard. If it runs on both commits but only passes on one, depending on whether it's on base or not, display it as a legitimate new success/failure.

Yup, I could also add this if needed, but probably in a separate PR later.

@digantdesai
Copy link
Contributor

Thinking out loud, I guess I like option (1) for "normal experiments". That way we users see what they care about, assuming users know what they want to run on.

For cron runs we can do (2), and treat missing as failures, and somehow filter out those runs for better cross backend analysis?

@kimishpatel
Copy link
Contributor

What do you think about this new view to fix the issue https://torchci-git-fork-huydhn-tweak-et-dashboard-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch? It's from this change pytorch/test-infra#6234 and I think it helps remove most of the confusion.

If the data point for a specific model+config doesn't exist on both commits, hide it from the dashboard. If it runs on both commits but only passes on one, depending on whether it's on base or not, display it as a legitimate new success/failure.

Yup, I could also add this if needed, but probably in a separate PR later.

@huydhn I dont follow what changed. I still see numbers dropping to zero

Image

In general I agree with both the points @guangy10 made. I have run into similar issues where I didnt understand what "0" meant.

Also agree that for user requested runs failures should be explicitly marked such.

With respect to continuous cron jobs I think we should have workflows that keep the dashboard up-to-date as much as possible. Else my worry is that we wont have complete data and we can have a few commits where jobs continue to fail and we dont catch regression (which we would have caught if the job succeeded). I recognize this is not always possible, but

  • if we have failure on more than one or two commits, we should definitely trigger an alert.
  • And definitely filter out such commits from dashboard charts that not devs but managers may look

@guangy10
Copy link
Contributor Author

guangy10 commented Feb 1, 2025

Oh it’s likely closed by mistake. The merged PR only addressed part of #7986 as stated in the summary, which add an option to filter by platform. I think Huy mentioned there will be a separate PR to address the what described in this issue.

@guangy10 guangy10 reopened this Feb 1, 2025
@huydhn
Copy link
Contributor

huydhn commented Feb 1, 2025

Yeah, pytorch/test-infra#6234 wasn't meant to close this. After chatting with @guangy10, we decide to land pytorch/test-infra#6234 first because it's an useful tweak, while the fix will be in a follow-up PR.

Thank @guangy10 for reopening this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: benchmark Issues related to the benchmark infrastructure triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants