-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark suite to CI #15696
Add benchmark suite to CI #15696
Conversation
ad7192a
to
49fedd5
Compare
49fedd5
to
4e0559a
Compare
.github/workflows/benchmark.yml
Outdated
sed -i 's/secret_key = ""/secret_key = "${{ secrets.PHP_VERSION_BENCHMARK_AWS_SECRET_KEY }}"/g' ./php-version-benchmarks/build/infrastructure/config/aws.tfvars | ||
|
||
cp ./php-version-benchmarks/config/php/master.ini.dist ./php-version-benchmarks/config/php/master1.ini | ||
YESTERDAY="$(date -d "-2 day 13:00" '+%Y-%m-%d')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel's benchmarks years ago used a single baseline commit: https://externals.io/message/100532 They used a specific PHP 7.0 commit. I'm not sure what the best would be for us, but it may be useful to have a clear understanding about the performance of a PHP version since its development started (e.g. the baseline could be the first commit after the latest minor PHP version was branched|)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arnaud-lb What do you think about this possibility? Should we add a specific commit (i.e. the first commit after PHP 8.4 is branched) to the benchmark? This way, we would compare the latest commit with the baseline and yesterday's commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just added it, because I think it will come handy if we also have a single fixed commit against which we can compare changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
I think that comparing two commits every run is a good idea, for the reasons you specified. Do you plan to also export some visualizations? It would be nice to have a visualization of the delta between the two commits, and another one with the absolute result of master. Could this be integrated with https://nielsdos.github.io/php-benchmark-visualisation/? (cc @nielsdos)
.github/workflows/benchmark.yml
Outdated
run: | | ||
set -e | ||
|
||
cp -r "php-version-benchmarks/tmp/php_master/" "php-version-benchmarks/tmp/php_master_jit" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to preserve attributes such as timestamps and permissions with -a
.
git clone "php-version-benchmarks/tmp/php_master/" "php-version-benchmarks/tmp/php_master_jit"
would work, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info, I didn't know this is possible. Even though I cannot think of a situation when these attributes would be important for the benchmark, I'm fine with the change as it only improves things.
.github/workflows/benchmark.yml
Outdated
|
||
cp ./php-version-benchmarks/config/php/master.ini.dist ./php-version-benchmarks/config/php/master1.ini | ||
YESTERDAY="$(date -d "-2 day 13:00" '+%Y-%m-%d')" | ||
YESTERDAY_SHA="$(cd ./php-version-benchmarks/tmp/php_master/ && git --no-pager log --until="$YESTERDAY 23:59:59" -n 1 --pretty='%H')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential issue is that if this particular commit is broken or if the job is temporarily broken, we will lose some data points. E.g. if we have this:
- commit3 // master
- commit2
- commit1 // YESTERDAY_SHA
If commit3 is broken, and commit2 has a performance regression, we might not notice it.
Could we store the commit hash of the last successful benchmark in kocsismate/php-version-benchmark-results, and use that as a base?
This would also make it less likely to benchmark a commit in the middle of a merged branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea, and I also agree with the concern as well as the suggestion.
Another related concern is that it is currently not possible to automatically detect issues with the tests. Of course, totally broken PHP versions won't be able to run any tests, so I guess the benchmark will fail. However, if there are only smaller issues (e.g. warnings, notices or there are other kind of error) then it's possible that a different code is executed than otherwise, making the results unreliable again.
Fortunately, I've already added support for manual examination of the test outputs because I printed the response of the first request to the stdout, so one could have a look at them (I usually do a brief check). This should be extended with automatic checks for error messages, and if any is found, the benchmark should be stopped.
(tangentially related topic: I can just really hope that we will be able to use the benchmark during the development phase of PHP 9.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark now verifies the test outputs: kocsismate/php-version-benchmarks@142e907 So the only missing part is to parse the commit hash of the last successful build (which can be retrieved from the database.tsv file).
(side note: I guess I'll also have to make the database store the results per yer too, just in case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The per year database storage is also implemented: kocsismate/php-version-benchmarks@0e84b9e
.github/workflows/benchmark.yml
Outdated
echo "Merging, can't proceed" | ||
exit 1 | ||
fi | ||
git add . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is creating one directory in the root every day. I'm not sure, but could this become an issue after some time due to the amount of entries in the root?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting observation! I can agree with you, this is at least a threat in the future. Do you think it's enough to put the results into a single "year" directory? This way a maximum of only ~365 files would be in a directory, assuming that we will continue to run the benchmark once per day. I could also imagine a year/month/result directory structure if we want to really future proof the structure....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this seems reasonable and probably enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the suggestion: kocsismate/php-version-benchmarks@cc0090b
@arnaud-lb I can integrate it into the visualisation website once merged. I'm a bit busy with real life atm though so may take some time. |
I added support for gathering the instruction count via Valgrind (kocsismate/php-version-benchmarks@f55c32d), similarly how Ilija's benchmark does it, but the result didn't correlate to the exact wall-time performance indeed :( So I made this metric disabled by default (kocsismate/php-version-benchmarks@3f31401). |
I ran two benchmarks after my review fixes: php/real-time-benchmark-data@cbf7425?short_path=44a0fb3#diff-44a0fb3a1464ed3a62f884a3b2ad83ca08f7c97f3c37e7e3f497d4cf1b249e4c and php/real-time-benchmark-data@bcc0143?short_path=e15a4d6#diff-e15a4d62fc9d8b702c8346c9aa07e4f827b1eea540e21e4785ea4cd77acebe78 The first time, the last commit the day before yesterday was the baseline because the database was missing. The second time, the previously benchmarked commit was compared to Niels' latest commit :) |
a8819e8
to
a996c8d
Compare
The commit is now the latest one. This should be changed later for the one after PHP 8.4 is branched.
a996c8d
to
23fcb24
Compare
@iluuu1994 Can I ask you to have a look at this PR? |
Unless there's any concern, I plan to merge this after branching happened |
@kocsismate Sorry, I didn't have much time. You should definitely move the data repo to the php org before merging. I don't have much else to say, this is completely isolated and shouldn't cause issues for anything else. |
I would also be happy if we could distinguish the terminology for the two benchmarks somehow. I don't mind renaming the existing benchmark too. |
For what it's worth, until a couple of years ago the PHP on Windows QA team did performance tests for all releases (QA and GA) in a controlled environment. Unfortunately, that had been shut down, so it's nice to have a similar performance testing available again (and for PHP, Linux is more relevant than Windows anyway). |
Yeah, I also tried to come up with some sensible naming, but I couldn't decide.. But would you think using "Valgrind benchmark" for yours and "Performance benchmark" for mine make sense? |
"Valgrind" or "instruction count" for the existing, and "real-time" would make most sense to me. |
I'd go with "valgrind" because it's much shorter than "instruction count", but I'd say it's your call to choose the name. Personally, I'm fine with "real-time" for the new benchmark. |
Maybe call it iCount (unless that is already a trademark). ;) |
Not that I don't appreciate the humor, but I think we should stick with something more standard 😄 If the intention is to shorten it, many tools just call it IC. |
That was tongue-in-cheek. I don't see why "instruction count" would be too long, anyway. |
"Instruction count benchmark" sounds a bit long, while "Valgrind benchmark" is easier to say, at least for me. But I really don't mind which name it gets. |
@iluuu1994 I don't have enough permissions to transfer it. Could you please grant me privilege to create repositories in the php org? |
This PR integrates https://github.com/kocsismate/php-version-benchmarks/ into the CI as a nightly job running every day at 12:30 AM. Roughly, the following happens: the benchmark suite spins up an AWS EC2 instance via Terraform, runs the tests according to the configuration, and then the results are committed to the https://github.com/kocsismate/php-version-benchmark-results repository.
In order to have as stable results as possible, the CPU, kernel and other settings of the AWS instance are fine-tuned:
Customizing the CPU is only supported by metal instances among recent instance types according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processor_state_control.html, so at last, a
c7i.metal-24xl
instance is used in theeu-west-1
region.The benchmark suite compares the performance of the latest commit of the master branch in the time when the benchmark runs with the last commit of master from the day before yesterday. I.e. if the benchmark runs tomorrow morning at 12:30 AM, then the performance of the latest commit will be benchmarked against the last commit pushed yesterday. This makes it possible to spot outstanding regressions (or progressions) in time. Actually, the end goal is to send notifications in case of any significant changes for further analyzation. The reason why the benchmark is run for previous commits as well (while they may have already been measured the day before) is to make the results less sensitive for changes in the environment or the benchmark suite itself. I.e.: if AWS upgrades the OS, or if the code under test is modified, then the numbers will likely be affected, and the previous results will be invalidated).
According to the first two test runs on a metal instance, the following results were measured (without using the dedicated instance feature for now):
As it can be seen, the difference between the median results of the two runs have less than
0.001
sec difference in case of most tests. This is very promising as it suggests that the environment is stable and the results are precise. To have a better understanding of the actual accuracy of the results, I ran the benchmark comparing the very same commits to each other in the hope that the performance diff would be very close to 0. Here are the results:Indeed, real word tests show a maximum of relative difference of 0.08%, while the synthetic ones vary by a maximum of 0.35%.