JOURNAL OF COMPUTERS (JCP)
ISSN : 1796-203X
Volume : 1    Issue : 6    Date : September 2006

Fault Tolerance in a Multi-Layered DRE System: A Case Study
Paul Rubel, Joseph Loyall, Richard Schantz and Matthew Gillen
Page(s): 43-52
Full Text:
PDF (495 KB)


Abstract
Dynamic resource management is a crucial part of the infrastructure for emerging distributed
real-time embedded systems, responsible for keeping mission-critical applications operating and
allocating the resources necessary for them to meet their requirements. Because of this, the
resource manager must be fault-tolerant, with nearly continuous operation. This paper describes
our efforts to develop a fault-tolerant multi-layer dynamic resource management capability and the
challenges we encountered, some due to the fault tolerance requirements we needed to meet and
others due to characteristics of the resource management software. The challenges include the
need for extremely rapid recovery; supporting the characteristics of component middleware,
including peer-to-peer communication and multi-tiered calling semantics; supporting multiple
languages; and the co-existence of replicated and non-replicated elements. Making our multi-layer
dynamic resource manager fault-tolerant required simultaneously overcoming all of these
challenges, presenting a significant fault tolerance research challenge.

Index Terms
fault tolerance, multi-layer dynamic resource management, component middleware, distributed
real-time embedded systems