-
Notifications
You must be signed in to change notification settings - Fork 9
Configuration of Measurement Processes
The configuration of performance measurement processes is a challanging task and has been researched in computer science widely. This page aims for giving a starting point for practitioners.
When executing a performance measurement, especially measurements of durations smaller than a millisecond, various non-deterministic effects shape your performance, including the inaccuracy of the time measurement method, Just-in-Time (JIT) compilation, garbage collections, thread scheduling and memory fragmentation. Therefore, the performance measurement needs to be repeated. This repetitions needs to be done on at least two levels
- Inside of one VM, the measurement needs to be repeated to wait for warmup to finish (e.g. to wait until JIT compilation is finished), i.e. until the steady state is reached.
- The VM starts itself need to be repeated, since a warmup may end in different steady states. Tools measuring the performance provide the environment for executing those measurement; the concrete configuration is specific to use cases and left to the user (which, in this field, is a software developer or a performance engineer).
By definition of artificial workload pairs (e.g. creation and addition of 300/(300+d) integers, reservation of 20/20+d blocks), we evaluated when a performance change can be measured. The summary of our results is: A Performance change can be measured if the relative change is at least half of the standard deviation of the VM measurements. The standard deviation of the VM measurements may be decreased by increasing warmup and iterations inside a VM. Depending on the relation between relative change that should be (at least) measured and standard deviation of the measurements, more or less VM executions are needed. Some practitioners recommend to use at least 30 VM starts; to measure a performance change of 0,3% (e.g. the change between 300 and 301 additions), 400 VMs, 5 iterations and warmup iterations and 100 000 repetitions are required.
If you want to try this for other artificial workload pairs, have a look at the repository precision-experiments.
Our basic approach for measuring the performance of unit tests has been described in
- Reichelt, David Georg, and Stefan Kühne. "How to Detect Performance Changes in Software History: Performance Analysis of Software System Versions." Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. 2018.
If you like to dig deeper into this topic, consider the following publications:
- Georges, Andy, Dries Buytaert, and Lieven Eeckhout. "Statistically rigorous Java performance evaluation." ACM SIGPLAN Notices 42.10 (2007): 57-76.
- Kalibera, Tomas, and Richard Jones. "Rigorous benchmarking in reasonable time." Proceedings of the 2013 international symposium on memory management. 2013.
- Barrett, Edd, et al. "Virtual machine warmup blows hot and cold." Proceedings of the ACM on Programming Languages 1.OOPSLA (2017): 1-27.