Department of Computer Science at UH

University of Houston

Department of Computer Science

In Partial Fulfillment of the Requirements for the Degree of
Master of Science

Hien Xuan Nguyen

Will defend his thesis

An Execution Environment for Robust Parallel Computing on Volunteer PC Grids

Abstract

Emerging platforms for large scale parallel and distributed computing are increasingly heterogeneous, and support for resistance to failure is becoming increasingly important. This research addresses these challenges in the context of volunteer PC grids. Execution of communicating parallel codes on volunteer PCs presents a multitude of challenges in an environment characterized by system and network heterogeneity, frequent failures, data corruption and malicious hosts. Our approach employs a programming model based on one-sided Put/Get calls to an abstract global shared space that works seamlessly even when processes are replicated for fault tolerance. The key challenge in this research is ensuring continuous forward application progress despite frequent failures. A suite of techniques is employed to make this possible: node selection to minimize chance of failure during execution, redundant processes to provide application level failure resistance, checkpointing to allow application rollback, and heartbeat monitoring and hot spare nodes for quick restarts. These well known techniques are customized for a highly heterogeneous and fault prone environment and integrated in the execution environment. A Replica Exchange Molecular Dynamics code executing on a volunteer pool is employed to drive evaluation. The execution environment includes hosts on a University campus as well as hosts distributed around the world. Results are presented that show the effectiveness of redundancy, node selection, and checkpointing on the quality of long lived execution in a volunteer environment. The research shows a practical approach to employing idle PCs for latency tolerant communicating parallel applications and develops methods and results in resilient computing with applications not only to desktop PC grids, but also to clouds and heterogeneous clusters.

Date: Tuesday, November 29, 2011
Time: 11:30 AM
Place: 550-PGH

Faculty, students, and the general public are invited.
Advisor: Prof. Jaspal Subhlok