-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increases benchmark warm-up and decreases the threshold value #268
Conversation
A threshold of 0.5 still allows a 50% performance regression, right? That's still not going to be low enough for our needs. How did you select 150 for warmups/iterations? Are you seeing diminishing returns in terms of reduced variance as you increase the warmups/iterations, or does the regression detection workflow just start taking too long? |
If 0.5 doesn't satisfy our needs, we need to consider other improvements as well. There is an error when running 200+ warm-ups - actions/runner-images#6680. The Github runner only allocate fixed resource (e.g., memory) to jobs. Running over 200 times make the job cancel automatically. Additionally the results of relative-difference are usually < 0.3 but sometimes one sample file (real_worlds_data_1) fail with 0.5+ perf regressions (and if the workflow fails, it's always this file). One possible solution to enable more iteration and warmups is using a larger runner - https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners |
Found a small memory leak in each |
Confirmed that |
Fixed the memory leak for read and it worked - 9858b44 but still failed for large execution number. Investigating. |
@@ -12,7 +12,7 @@ jobs: | |||
name: Detect Regression | |||
needs: PR-Content-Check | |||
if: ${{ needs.PR-Content-Check.outputs.result == 'pass' }} | |||
runs-on: ubuntu-latest | |||
runs-on: macos-latest # ubuntu-latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we should only use MacOS runners if we need to test specifically that something works correctly on MacOS or we're distributing a MacOS-specific binary.
What's the rationale for running on MacOS here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to run more warm-ups and iterations for the benchmark-cli command in order to generate a consistent threshold value to help us identify the performance regression. However, command with 150+ executions will hit the resource limitation of the GHA linux runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which resource limit? Is it causing the job to timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the workflows run into this issue - actions/runner-images#6680
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found another issue that causes the workflow to generate a wrong threshold value. After fixing that issue, the threshold value seems to be more stable. But ideally we will eventually benchmark performance on all popular platforms.
This PR The threshold is set to 0.2 for now, we will work on its enhancement in the future. An example performance detection failed due to variance. |
I changed the target commit to the main branch so we don't have to create a new PR for that. In addition, I
|
This PR
(1) Addressed a memory leak issue - link
(2) Experimented the GHA runners' resource limitation and switched to MacOS for 1k iterations and warmups - link
(3) Added a method for Ion binary/text conversion - link
(4) Fixed a comparison result mismatch issue by sorting the combination list - link
The threshold is set to 0.2 for now, we will work on its enhancement in the future. The latest performance detection failed due to the variance.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.