SCS Undergraduate Thesis Topics

2006-2007
Student Advisor(s) Thesis Topic
Hassan Rom Greg Ganger Robust Detection & Recovery from Service Disruptions in Distributed Systems

Distributed systems are complex to design, build and debug. Components crash due to software bugs such as unchecked array bounds, logic errors, and unchecked return codes. Components hang due to deadlocks and resource leaks. Instead of relying on failure-proofing the services, we focus on detecting and restarting the failed components. With the concepts and benefits of restarting in mind, I will present a "watchdog" service for detecting and recovering from failed components in a distributed system. As a case study, I will describe the design and implementation of a watchdog service for a distributed storage system called Ursa Minor. Experiences and novel extensions will be highlighted.


Close this window