In Partial Fulfillment of the Requirements for the Degree of
Master of Science
Will defend his thesis
Emerging platforms for large scale parallel and distributed computing are increasingly heterogeneous, and support for resistance to failure is becoming increasingly important. This research addresses these challenges in the context of volunteer PC grids. Execution of communicating parallel codes on volunteer PCs presents a multitude of challenges in an environment characterized by system and network heterogeneity, frequent failures, data corruption and malicious hosts. Our approach employs a programming model based on one-sided Put/Get calls to an abstract global shared space that works seamlessly even when processes are replicated for fault tolerance. The key challenge in this research is ensuring continuous forward application progress despite frequent failures. A suite of techniques is employed to make this possible: node selection to minimize chance of failure during execution, redundant processes to provide application level failure resistance, checkpointing to allow application rollback, and heartbeat monitoring and hot spare nodes for quick restarts. These well known techniques are customized for a highly heterogeneous and fault prone environment and integrated in the execution environment. A Replica Exchange Molecular Dynamics code executing on a volunteer pool is employed to drive evaluation. The execution environment includes hosts on a University campus as well as hosts distributed around the world. Results are presented that show the effectiveness of redundancy, node selection, and checkpointing on the quality of long lived execution in a volunteer environment. The research shows a practical approach to employing idle PCs for latency tolerant communicating parallel applications and develops methods and results in resilient computing with applications not only to desktop PC grids, but also to clouds and heterogeneous clusters.