SCIENCE CHINA Information Sciences, Volume 61, Issue 9: 092110(2018) https://doi.org/10.1007/s11432-017-9292-9

Asymmetric virtual machine replication for low latency and high available service

More info
  • ReceivedJul 16, 2017
  • AcceptedNov 14, 2017
  • PublishedJun 20, 2018


Providing fault tolerance support to client-to-server applications is criticalin the data center and cloud computing environments.Virtualization provides a direct way of achieving high availability byencapsulating the protected applications into the virtual machine andby periodically checkpointing the entire virtual machine (VM) state to the backup replication.However, existing VM replication solutions suffer fromeither excessive checkpointing overhead and network latencyor unnecessary CPU resources consumption in backup replication.In this study, we exploit the ingredients of output packetsand consider that the replication system maintains external consistencyif the pre-released packets originate the already synchronized states.Furthermore, we transform the active-active primary and slave VM combination intoan active-semiactive one by shrinking the number of active virtual CPUs (vCPUs) in the slave VM.The former optimization mechanism improves the performance in read-mostly client-to-server networked applications,whereas the latter one relieves the problem of double scheduling in the slave host.Therefore, we proposed the COLO+ system which is built over COLO and isa non-stop service solution with coarse-grained lock-stepping VMs for client-to-server systems.The two plus signs represent two of the optimizations.Experimental results using COLO+ implemented on KVM and Linux depict thatit achieves nearly native VM performance under read-mostly workloads, as well aslower scheduling overhead in backup replication.


[1] Jiang B, Ravindran B, Kim C. Lightweight live migration for high availability cluster service. In: Proceedings of the 12th International Conference on Stabilization, Safety, and Security of Distributed Systems, New York, 2010. 420--434. Google Scholar

[2] Mullender S. Distributed systems. United States of America: ACM Press, 1993: 12. Google Scholar

[3] Kivity A, Kamay Y, Laor D, et al. Kvm: the Linux virtual machine monitor. In: Proceedings of the Linux Symposium, Ottawa, 2007. 1: 225--230. Google Scholar

[4] Barham P, Dragovic B, Fraser K, et al. Xen and the art of virtualization. In: Proceedings of the ACM SIGOPS Operating Systems Review, New York, 2003. 164--177. Google Scholar

[5] Bressoud T C, Schneider F B. Hypervisor-based fault tolerance. ACM Trans Comput Syst, 1996, 14: 80-107 CrossRef Google Scholar

[6] Cully B, Lefebvre G, Meyer D, et al. Remus: high availability via asynchronous virtual machine replication. In: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, San Francisco, 2008. 161--174. Google Scholar

[7] Dong Y Z, Ye W, Jiang Y H, et al. Colo: coarse-grained lock-stepping virtual machines for non-stop service. In: Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara, 2013. 3. Google Scholar

[8] Clark C, Fraser K, Hand S, et al. Live migration of virtual machines. In: Proceedings of the 2nd Symposium on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2005. 2: 273--286. Google Scholar

[9] Elnozahy E N M, Alvisi L, Wang Y M. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv, 2002, 34: 375-408 CrossRef Google Scholar

[10] Friebel T, Biemueller S. How to deal with lock holder preemption. In: Proceedings of Xen Summit North America, Boston, 2008. 164. Google Scholar

[11] Enck W, Gilbert P, Han S, et al. TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans Comput Syst, 2014, 32: 5. Google Scholar

[12] Song X, Shi J, Chen H, et al. Schedule processes, not VCPUs. In: Proceedings of the 4th Asia-Pacific Workshop on Systems, New York, 2013. 1. Google Scholar

[13] Cheng L, Rao J, Lau F. vScale: automatic and efficient processor scaling for SMP virtual machines. In: Proceedings of the 11th European Conference on Computer Systems, New York, 2016. 2. Google Scholar

[14] Russell R. Virtio: towards a de-facto standard for virtual I/O devices. ACM SIGOPS Oper Syst Rev, 2008, 42: 95--103. Google Scholar

[15] Intel® Page modification logging for virtual machine monitor white paper. Intel Whitepaper, 2015. https://www.intel.com/content/www/us/en/processors/page-modification-logging-vmm-white-paper.html. Google Scholar

[16] Intel® 82576 and 82599 Gigabit Ethernet controller datashee. Intel Whitepaper, 2002. https://www.intel.com/content/www/us/en/embedded/products/networking/82599-10-gbe-controller-datasheet.html. Google Scholar

[17] Fitzpatrick B. Distributed caching with memcached. Linux J, 2004, 2004: 5. Google Scholar

[18] Kopytov A. SysBench: a system performance benchmark. http://sysbench.sourceforge.net, 2004. Google Scholar

[19] Bienia C, Kumar S, Singh J P, et al. The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, New York, 2008. 72--81. Google Scholar

[20] Castro M, Liskov B. Practical byzantine fault tolerance and proactive recovery. ACM Trans Comput Syst, 2002, 20: 398-461 CrossRef Google Scholar

[21] Lamport L, Shostak R, Pease M. The Byzantine generals problem. ACM Trans Program Lang Syst, 1982, 4: 382-401 CrossRef Google Scholar

[22] Schneider F B. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput Surv, 1990, 22: 299-319 CrossRef Google Scholar

[23] Bernick D, Bruckert B, Vigna P D, et al. NonStop/spl reg/advanced architecture. In: Proceedings of the International Conference on Dependable Systems and Networks, Yokohama, 2005. 12--21. Google Scholar

[24] Webber S, Beirne J. The stratus architecture. In: Proceedings of the 21st International Symposium on Fault-Tolerant Computing, Montréal, 1991. 79--85. Google Scholar

[25] Jeffery C M, Figueiredo R J O. A flexible approach to improving system reliability with virtual lockstep. IEEE Trans Dependable Secure Comput, 2012, 9: 2-15 CrossRef Google Scholar

[26] Jeffery C M, Figueiredo R J O. A flexible approach to improving system reliability with virtual lockstep. IEEE Trans Dependable Secure Comput, 2012, 9: 2-15 CrossRef Google Scholar

[27] Scales D J, Nelson M, Venkitachalam G. The design of a practical system for fault-tolerant virtual machines. ACM SIGOPS Operat Syst Rev, 2010, 44: 30--39. Google Scholar

[28] Reiser H P, Kapitza R. Hypervisor-based efficient proactive recovery. In: Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems. Washington: IEEE Computer Society, 2007. 83--92. Google Scholar

[29] Minhas U F, Rajagopalan S, Cully B. RemusDB: transparent high availability for database systems. VLDB J, 2013, 22: 29-45 CrossRef Google Scholar

[30] Lu M, Chiueh T. Fast memory state synchronization for virtualization-based fault tolerance. In: Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks, Lisbon, 2009. 534--543. Google Scholar

[31] Zhu J, Dong W, Jiang Z F, et al. Improving the performance of hypervisor-based fault tolerance. In: Proceedings of the International Symposium on Parallel and Distributed Processing, Atlanta, 2010. 1--10. Google Scholar

[32] Huang D, He B, Miao C. A survey of resource management in multi-tier web applications. IEEE Commun Surv Tut, 2014, 16: 1574-1590 CrossRef Google Scholar

[33] Liu H, He B. VMbuddies: coordinating live migration of multi-tier applications in cloud environments. IEEE Trans Parallel Distrib Syst, 2015, 26: 1192-1205 CrossRef Google Scholar

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有