-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DevX] Commits comparison via dashboard is misleading #7986
Comments
Subscribe @digantdesai @kimishpatel |
A more alarming example can be found here (dashboard), where it gave users the mistaken impression that the infrastructure was completely down. |
I think showing both Apple and Android perf bring more trouble that it worth. This can be fixed easily by choosing only Apple and Android at a time. I guess we don't need to compare perf across Apple and Android here |
@huydhn Updated: After thinking it more carefully I do recognize concrete needs require viewing both iOS and Android models on the same board.
@cbilgin @kimishpatel @digantdesai free feel to add more about how you want to utilize the dashboard for perf tracking |
What do you think about this new view to fix the issue https://torchci-git-fork-huydhn-tweak-et-dashboard-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch? It's from this change pytorch/test-infra#6234 and I think it helps remove most of the confusion.
Yup, I could also add this if needed, but probably in a separate PR later. |
Thinking out loud, I guess I like option (1) for "normal experiments". That way we users see what they care about, assuming users know what they want to run on. For cron runs we can do (2), and treat missing as failures, and somehow filter out those runs for better cross backend analysis? |
@huydhn I dont follow what changed. I still see numbers dropping to zero In general I agree with both the points @guangy10 made. I have run into similar issues where I didnt understand what "0" meant. Also agree that for user requested runs failures should be explicitly marked such. With respect to continuous cron jobs I think we should have workflows that keep the dashboard up-to-date as much as possible. Else my worry is that we wont have complete data and we can have a few commits where jobs continue to fail and we dont catch regression (which we would have caught if the job succeeded). I recognize this is not always possible, but
|
Oh it’s likely closed by mistake. The merged PR only addressed part of #7986 as stated in the summary, which add an option to filter by platform. I think Huy mentioned there will be a separate PR to address the what described in this issue. |
Yeah, pytorch/test-infra#6234 wasn't meant to close this. After chatting with @guangy10, we decide to land pytorch/test-infra#6234 first because it's an useful tweak, while the fix will be in a follow-up PR. Thank @guangy10 for reopening this. |
🐛 Describe the bug
When comparing perfs between commits via the ExecuTorch dashbaord, things being displayed can be misleading.
For example, to view the perf changes for Llama-3.2 with fb16 config between
e00eaea98f
from 1/20 and576108322a
from 1/26 (dashboard),Users will see lots of data dropped to 0, which gives the impression of that those jobs failed, as shown in the attached image.
However, if we take a deep look into the commit
576108322a
, we will notice that from that commit, onlyapple-perf
is triggered. As comparison, bothandroid-perf
andapple-perf
are triggered frome00eaea98f
the base commit. It's expected that android benchmark jobs and apple benchmark jobs can be scheduled independently as they are managed by different github workflows, but this is the root cause of seeing many data dropped to 0 when comparing perfs between commits via the ExecuTorch dashbaord.I think there are two ways to address this issue:
android-perf
andapple-perf
. This could make it consistent when comparing commits on main, however, when comparing commits from on-demand runs where typically only runs with the selected model and config, we will hit this issue again.Versions
trunk
The text was updated successfully, but these errors were encountered: