ISSN : 1796-203X
Volume : 1    Issue : 8    Date : December 2006

Symmetric Active/Active High Availability for High-Performance Computing System Services
Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He
Page(s): 43-54
Full Text:
PDF (1,251 KB)

This work aims to pave the way for high availability in high-performance computing (HPC) by
focusing on efficient redundancy strategies for head and service nodes. These nodes represent
single points of failure and control for an entire HPC system as they render it inaccessible and
unmanageable in case of a failure until repair. The presented approach introduces two distinct
replication methods, internal and external, for providing symmetric active/active high availability for
multiple redundant head and service nodes running in virtual synchrony utilizing an existing process
group communication system for service group membership management and reliable, totally
ordered message delivery. Resented results of a prototype implementation that offers symmetric
active/active replication for HPC job and resource management using external replication show that
the highest level of availability can be provided with an acceptable performance trade-off.

Index Terms
high-performance computing, high availability, virtual synchrony, group communication