SCS Undergraduate Thesis Topics
|Hassan Rom||Greg Ganger||Robust Detection & Recovery from Service Disruptions in Distributed Systems|
Distributed systems are complex to design, build and debug. Components crash due to software bugs such as unchecked array bounds, logic errors, and unchecked return codes. Components hang due to deadlocks and resource leaks. Instead of relying on failure-proofing the services, we focus on detecting and restarting the failed components. With the concepts and benefits of restarting in mind, I will present a "watchdog" service for detecting and recovering from failed components in a distributed system. As a case study, I will describe the design and implementation of a watchdog service for a distributed storage system called Ursa Minor. Experiences and novel extensions will be highlighted.