In Partial Fulfillment of the Requirements for the Degree of
Master of Science
Will defend his thesis
The ever-growing computational requirements of scientific and engineering applications continue to drive new developments in High Performance Computing (HPC). These new developments are being challenged as scalability of HPC applications is at risk due to the high failure rates of current parallel computers.
Presently many applications are designed to run for days or longer on even the fastest computers, in spite of the high probability of failure attributed to the growth in system size and complexity. These levels of unreliability create a requirement for the use of fault-tolerance mechanisms to ensure the timely and correct completion of applications. However, regardless of the high degree of optimization, existing techniques do not fulfill the challenges posed by large-scale machines.
This thesis explores failures in parallel computing systems and how they affect HPC applications. In addition, we discuss potential enhancements to the OpenSHMEM specification in order to enable the development of fault-tolerant applications. OpenSHMEM is an effort to create a standard API for the development of portable and highly scalable Partitioned Global Address Space (PGAS) applications.
Date: Thursday, April 19, 2012
Time: 03:00 PM
Place: 550-PGH
Faculty, students, and the general public are invited.
Advisor: Dr. Barbara Chapman