Reducing fate sharing in software systems via fine-grained checkpoint and restore
[摘要] The client-server architecture is used widely throughout complex software systems, but is susceptible to the problem of fate sharing between its various internal components. Fate sharing is bad because the malfunction of the server can lead to incorrect behavior of the client. We have narrowed down the primary cause of such fate sharing to state spill, a phenomenon in which a server entity holds client-specific state after serving the client’s request, tightly binding the two entities’ fates. The problem is exacerbated when multiple clients utilized the same server, because it effectively binds all clients’ fates together as well.In this work, we propose DRILL, a solution that mitigates the effects of fate sharing in server-like entities by using fine-grained checkpoint and restore (C/R) techniques to reduce state spill. We describe the design of DRILL within the context of Android system services — entities that control most system resources and act as middlemen between the kernel and applications — because they are a representative example of server entities that suffer from fate sharing due to state spill. DRILL attaches its C/R module to the system service to checkpoint and restore internal clients’ states in a per-object fashion. This module is service-agnostic and non-intrusive, which is generic and can be used for many system services without much modification. A special bookkeeping service preserves each service’s checkpointed states in an external storage area so that it can resend the states to a new service instance post-crash.To demonstrate the effectiveness of our approach, we implement DRILL on a Google Nexus 5 phone running Android 6.0.1 and apply our C/R technique to two different classes of Android system services. We show that DRILL can successfully restore a failed system service to its pre-crash state, keeping the application blissfully unaware of any service crashes because its fate is decoupled from that of the service. Our results indicate that the performance overhead and service downtime of our approach are affordable and that the limitations of the DRILL design do not restrict it from being applied to other systems beyond Android.
[发布日期] [发布机构] Rice University
[效力级别] systems [学科分类]
[关键词] [时效性]