Duet Benchmarking, Improving Measurement Accuracy in the Cloud

Lubomír Bulej, Vojtěch Horký, Petr Tůma, François Farquet, Aleksandar Prokopec
ICPE 2020


We investigate the duet measurement procedure, which helps improve the accuracy of performance comparison experiments conducted on shared machines by executing the measured artifacts in parallel and evaluating their relative performance together, rather than individually. Specifically, we analyze the behavior of the procedure in multiple cloud environments and use experimental evidence to answer multiple research questions concerning the assumption underlying the procedure. We demonstrate improvements in accuracy ranging from 2.3× to 12.5× (5.03× on average) for the tested ScalaBench (and DaCapo) workloads, and from 23.8× to 82.4× (37.4× on average) for the SPEC CPU 2017 workloads.

[PDF] [BibTex] [ACM]


[1] Online Appendix. 2020. http://arxiv.org/abs/2001.05811.

[2] A. Abedi and T. Brecht. 2017. Conducting Repeatable Experiments in Highly Variable Cloud Computing Environments. In ICPE. ACM.

[3] S. M. Blackburn, R. Garner, C. Hoffmann, et al. 2006. The DaCapo Benchmarks: Java Benchmarking Development and Analysis. In OOPSLA. ACM.

[4] L. Bulej, T. Bureš, V. Horký, et al. 2016. Unit Testing Performance with Stochastic Performance Logic. Automated Software Engineering (2016).

[5] L. Bulej, V. Horký, and P. Tůma. 2019. Initial Experiments with Duet Benchmarking: Performance Testing Interference in the Cloud. In MASCOTS.

[6] D. Cerotti, M. Gribaudo, P. Piazzolla, and G. Serazzi. 2012. Flexible CPU Provisioning in Clouds: A New Source of Performance Unpredictability. In QEST.

[7] J. Ericson, M. Mohammadian, and F. Santana. 2017. Analysis of Performance Variability in Public Cloud Computing. In IRI.

[8] B. Farley, A. Juels, V. Varadarajan, et al. 2012. More for Your Money: Exploiting Performance Heterogeneity in Public Clouds. In SoCC. ACM.

[9] A. Georges, D. Buytaert, and L. Eeckhout. 2007. Statistically Rigorous Java Performance Evaluation. In OOPSLA.

[10] GitLab Inc. 2019. GitLab Runner. https://about.gitlab.com.

[11] S. He, G. Manns, J. Saunders, et al. 2019. A Statistics-Based Performance Testing Methodology for Cloud Applications. In ESEC/FSE. ACM, New York, NY, USA.

[12] C. Heger, J. Happe, and R. Farahbod. 2013. Automated Root Cause Isolation of Performance Regressions During Software Development. In ICPE. ACM.

[13] T. Hesterberg. 2014. What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum. arXiv:1411.5279 [stat] (2014).

[14] P. Huang, X. Ma, D. Shen, and Y. Zhou. 2014. Performance Regression Testing Target Prioritization via Performance Risk Analysis. In ICSE. ACM.

[15] A. Iosup, N. Yigitbasi, and D. Epema. 2011. On the Performance Variability of Production Cloud Services. In CCGRID.

[16] K. Joshi, A. Raj, and D. Janakiram. 2017. Sherlock: Lightweight Detection of Performance Interference in Containerized Cloud Services. In HPCC.

[17] C. Laaber, J. Scheuner, and P. Leitner. 2019. Software Microbenchmarking in the Cloud. How Bad is it Really? Empirical Software Engineering (2019).

[18] P. Leitner and J. Cito. 2016. Patterns in the Chaos—A Study of Performance Variation and Predictability in Public IaaS Clouds. ACM Trans. Internet Technol. 16, 3 (2016).

[19] A. Lenk, M. Menzel, J. Lipsky, S. Tai, and P. Offermann. 2011. What Are You Paying For? Performance Benchmarking for Infrastructure-as-a-Service Offerings. In CLOUD.

[20] A. Maricq, D. Duplyakin, I. Jimenez, et al. 2018. Taming Performance Variability. In OSDI. USENIX Association, Berkeley, CA, USA.

[21] J. Mukherjee, D. Krishnamurthy, and M. Wang. 2017. Subscriber-Driven Interference Detection for Cloud-Based Web Services. IEEE Trans. on Network and Service Management 14, 1 (2017).

[22] A. B. D. Oliveira, S. Fischmeister, A. Diwan, M. Hauswirth, and P. F. Sweeney. 2017. Perphecy: Performance Regression Test Selection Made Simple but Effective. In ICST.

[23] Oracle. 2019. GraalVM Repository at GitHub. https://github.com/oracle/graal.

[24] Z. Ou, H. Zhuang, A. Lukyanenko, et al. 2013. Is the Same Instance Type Created Equal? Exploiting Heterogeneity of Public Clouds. IEEE Trans. on Cloud Computing 1, 2 (2013).

[25] S. Ristov, R. Mathá, and R. Prodan. 2017. Analysing the Performance Instability Correlation with Various Workflow and Cloud Parameters. In PDP.

[26] J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. 2010. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. VLDB Endow. 3, 1-2 (2010). [27] A. Sewe, M. Mezini, A. Sarimbekov, and W. Binder. 2011. Da Capo Con Scala: Design and Analysis of a Scala Benchmark Suite for the Java Virtual Machine. In OOPSLA. ACM.

[28] S. Shankar, J. M. Acken, and N. K. Sehgal. 2018. Measuring Performance Variability in the Clouds. IETE Technical Review 35, 6 (2018).

[29] Standard Performance Evaluation Corporation. 2017. SPEC CPU 2017. https: //www.spec.org/cpu2017.

[30] Travis CI, GmbH. 2019. Travis CI. https://travis-ci.com