Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark suite to CI #15696

Merged
merged 5 commits into from
Sep 25, 2024
Merged

Add benchmark suite to CI #15696

merged 5 commits into from
Sep 25, 2024

Conversation

kocsismate
Copy link
Member

@kocsismate kocsismate commented Sep 1, 2024

This PR integrates https://github.com/kocsismate/php-version-benchmarks/ into the CI as a nightly job running every day at 12:30 AM. Roughly, the following happens: the benchmark suite spins up an AWS EC2 instance via Terraform, runs the tests according to the configuration, and then the results are committed to the https://github.com/kocsismate/php-version-benchmark-results repository.

In order to have as stable results as possible, the CPU, kernel and other settings of the AWS instance are fine-tuned:

Customizing the CPU is only supported by metal instances among recent instance types according to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processor_state_control.html, so at last, a c7i.metal-24xl instance is used in the eu-west-1 region.

The benchmark suite compares the performance of the latest commit of the master branch in the time when the benchmark runs with the last commit of master from the day before yesterday. I.e. if the benchmark runs tomorrow morning at 12:30 AM, then the performance of the latest commit will be benchmarked against the last commit pushed yesterday. This makes it possible to spot outstanding regressions (or progressions) in time. Actually, the end goal is to send notifications in case of any significant changes for further analyzation. The reason why the benchmark is run for previous commits as well (while they may have already been measured the day before) is to make the results less sensitive for changes in the environment or the benchmark suite itself. I.e.: if AWS upgrades the OS, or if the code under test is modified, then the numbers will likely be affected, and the previous results will be invalidated).

According to the first two test runs on a metal instance, the following results were measured (without using the dedicated instance feature for now):

As it can be seen, the difference between the median results of the two runs have less than 0.001 sec difference in case of most tests. This is very promising as it suggests that the environment is stable and the results are precise. To have a better understanding of the actual accuracy of the results, I ran the benchmark comparing the very same commits to each other in the hope that the performance diff would be very close to 0. Here are the results:

Indeed, real word tests show a maximum of relative difference of 0.08%, while the synthetic ones vary by a maximum of 0.35%.

@kocsismate kocsismate force-pushed the ci-benchmark branch 22 times, most recently from ad7192a to 49fedd5 Compare September 3, 2024 20:03
@kocsismate kocsismate marked this pull request as ready for review September 3, 2024 20:03
@kocsismate kocsismate requested a review from arnaud-lb September 3, 2024 20:04
@kocsismate kocsismate changed the title Add version benchmark to CI Add experimental benchmark to CI Sep 3, 2024
@kocsismate kocsismate requested a review from dstogov September 4, 2024 06:55
@kocsismate kocsismate changed the title Add experimental benchmark to CI Add benchmark suite to CI Sep 4, 2024
sed -i 's/secret_key = ""/secret_key = "${{ secrets.PHP_VERSION_BENCHMARK_AWS_SECRET_KEY }}"/g' ./php-version-benchmarks/build/infrastructure/config/aws.tfvars

cp ./php-version-benchmarks/config/php/master.ini.dist ./php-version-benchmarks/config/php/master1.ini
YESTERDAY="$(date -d "-2 day 13:00" '+%Y-%m-%d')"
Copy link
Member Author

@kocsismate kocsismate Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel's benchmarks years ago used a single baseline commit: https://externals.io/message/100532 They used a specific PHP 7.0 commit. I'm not sure what the best would be for us, but it may be useful to have a clear understanding about the performance of a PHP version since its development started (e.g. the baseline could be the first commit after the latest minor PHP version was branched|)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaud-lb What do you think about this possibility? Should we add a specific commit (i.e. the first commit after PHP 8.4 is branched) to the benchmark? This way, we would compare the latest commit with the baseline and yesterday's commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added it, because I think it will come handy if we also have a single fixed commit against which we can compare changes.

Copy link
Member

@arnaud-lb arnaud-lb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

I think that comparing two commits every run is a good idea, for the reasons you specified. Do you plan to also export some visualizations? It would be nice to have a visualization of the delta between the two commits, and another one with the absolute result of master. Could this be integrated with https://nielsdos.github.io/php-benchmark-visualisation/? (cc @nielsdos)

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved
run: |
set -e

cp -r "php-version-benchmarks/tmp/php_master/" "php-version-benchmarks/tmp/php_master_jit"
Copy link
Member

@arnaud-lb arnaud-lb Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to preserve attributes such as timestamps and permissions with -a.
git clone "php-version-benchmarks/tmp/php_master/" "php-version-benchmarks/tmp/php_master_jit" would work, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, I didn't know this is possible. Even though I cannot think of a situation when these attributes would be important for the benchmark, I'm fine with the change as it only improves things.


cp ./php-version-benchmarks/config/php/master.ini.dist ./php-version-benchmarks/config/php/master1.ini
YESTERDAY="$(date -d "-2 day 13:00" '+%Y-%m-%d')"
YESTERDAY_SHA="$(cd ./php-version-benchmarks/tmp/php_master/ && git --no-pager log --until="$YESTERDAY 23:59:59" -n 1 --pretty='%H')"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential issue is that if this particular commit is broken or if the job is temporarily broken, we will lose some data points. E.g. if we have this:

- commit3 // master
- commit2
- commit1 // YESTERDAY_SHA

If commit3 is broken, and commit2 has a performance regression, we might not notice it.

Could we store the commit hash of the last successful benchmark in kocsismate/php-version-benchmark-results, and use that as a base?

This would also make it less likely to benchmark a commit in the middle of a merged branch.

Copy link
Member Author

@kocsismate kocsismate Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea, and I also agree with the concern as well as the suggestion.

Another related concern is that it is currently not possible to automatically detect issues with the tests. Of course, totally broken PHP versions won't be able to run any tests, so I guess the benchmark will fail. However, if there are only smaller issues (e.g. warnings, notices or there are other kind of error) then it's possible that a different code is executed than otherwise, making the results unreliable again.

Fortunately, I've already added support for manual examination of the test outputs because I printed the response of the first request to the stdout, so one could have a look at them (I usually do a brief check). This should be extended with automatic checks for error messages, and if any is found, the benchmark should be stopped.

(tangentially related topic: I can just really hope that we will be able to use the benchmark during the development phase of PHP 9.0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark now verifies the test outputs: kocsismate/php-version-benchmarks@142e907 So the only missing part is to parse the commit hash of the last successful build (which can be retrieved from the database.tsv file).

(side note: I guess I'll also have to make the database store the results per yer too, just in case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per year database storage is also implemented: kocsismate/php-version-benchmarks@0e84b9e

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved
.github/workflows/benchmark.yml Outdated Show resolved Hide resolved
echo "Merging, can't proceed"
exit 1
fi
git add .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is creating one directory in the root every day. I'm not sure, but could this become an issue after some time due to the amount of entries in the root?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting observation! I can agree with you, this is at least a threat in the future. Do you think it's enough to put the results into a single "year" directory? This way a maximum of only ~365 files would be in a directory, assuming that we will continue to run the benchmark once per day. I could also imagine a year/month/result directory structure if we want to really future proof the structure....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this seems reasonable and probably enough

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the suggestion: kocsismate/php-version-benchmarks@cc0090b

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved
@nielsdos
Copy link
Member

nielsdos commented Sep 5, 2024

@arnaud-lb I can integrate it into the visualisation website once merged. I'm a bit busy with real life atm though so may take some time.

@kocsismate
Copy link
Member Author

I added support for gathering the instruction count via Valgrind (kocsismate/php-version-benchmarks@f55c32d), similarly how Ilija's benchmark does it, but the result didn't correlate to the exact wall-time performance indeed :( So I made this metric disabled by default (kocsismate/php-version-benchmarks@3f31401).

@kocsismate
Copy link
Member Author

kocsismate commented Sep 11, 2024

I ran two benchmarks after my review fixes: php/real-time-benchmark-data@cbf7425?short_path=44a0fb3#diff-44a0fb3a1464ed3a62f884a3b2ad83ca08f7c97f3c37e7e3f497d4cf1b249e4c and php/real-time-benchmark-data@bcc0143?short_path=e15a4d6#diff-e15a4d62fc9d8b702c8346c9aa07e4f827b1eea540e21e4785ea4cd77acebe78

The first time, the last commit the day before yesterday was the baseline because the database was missing. The second time, the previously benchmarked commit was compared to Niels' latest commit :)

@kocsismate kocsismate force-pushed the ci-benchmark branch 2 times, most recently from a8819e8 to a996c8d Compare September 12, 2024 09:15
The commit is now the latest one. This should be changed later for the one after PHP 8.4 is branched.
@TimWolla TimWolla removed their request for review September 15, 2024 13:18
@kocsismate
Copy link
Member Author

@iluuu1994 Can I ask you to have a look at this PR?

@kocsismate
Copy link
Member Author

Unless there's any concern, I plan to merge this after branching happened

@iluuu1994
Copy link
Member

@kocsismate Sorry, I didn't have much time. You should definitely move the data repo to the php org before merging. I don't have much else to say, this is completely isolated and shouldn't cause issues for anything else.

@iluuu1994
Copy link
Member

I would also be happy if we could distinguish the terminology for the two benchmarks somehow. I don't mind renaming the existing benchmark too.

@cmb69
Copy link
Member

cmb69 commented Sep 24, 2024

For what it's worth, until a couple of years ago the PHP on Windows QA team did performance tests for all releases (QA and GA) in a controlled environment. Unfortunately, that had been shut down, so it's nice to have a similar performance testing available again (and for PHP, Linux is more relevant than Windows anyway).

@kocsismate
Copy link
Member Author

I would also be happy if we could distinguish the terminology for the two benchmarks somehow. I don't mind renaming the existing benchmark too.

Yeah, I also tried to come up with some sensible naming, but I couldn't decide.. But would you think using "Valgrind benchmark" for yours and "Performance benchmark" for mine make sense?

@iluuu1994
Copy link
Member

"Valgrind" or "instruction count" for the existing, and "real-time" would make most sense to me.

@kocsismate
Copy link
Member Author

"Valgrind" or "instruction count" for the existing, and "real-time" would make most sense to me.

I'd go with "valgrind" because it's much shorter than "instruction count", but I'd say it's your call to choose the name. Personally, I'm fine with "real-time" for the new benchmark.

@cmb69
Copy link
Member

cmb69 commented Sep 24, 2024

Maybe call it iCount (unless that is already a trademark). ;)

@iluuu1994
Copy link
Member

Not that I don't appreciate the humor, but I think we should stick with something more standard 😄 If the intention is to shorten it, many tools just call it IC.

@cmb69
Copy link
Member

cmb69 commented Sep 24, 2024

That was tongue-in-cheek. I don't see why "instruction count" would be too long, anyway.

@kocsismate
Copy link
Member Author

I don't see why "instruction count" would be too long, anyway.

"Instruction count benchmark" sounds a bit long, while "Valgrind benchmark" is easier to say, at least for me. But I really don't mind which name it gets.

@kocsismate
Copy link
Member Author

kocsismate commented Sep 24, 2024

You should definitely move the data repo to the php org before merging.

@iluuu1994 I don't have enough permissions to transfer it. Could you please grant me privilege to create repositories in the php org?

@kocsismate kocsismate merged commit 2448a01 into php:master Sep 25, 2024
10 checks passed
@kocsismate kocsismate deleted the ci-benchmark branch September 25, 2024 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants