Distributed computing system such as [MapReduce]\[1\], and Dryad\[2\] provide fault tolerance feature to help the system survive over the process crash. It is particular useful when computation requires to finish its execution in long time. Hama, based on the BSP\[3\] model, is a framework for massive scientific computations, which also requires this feature so that developers and users who exploit this framework can benefit from it. This page serves for providing information on direction how Hama [GroomServer] fault tolerance would work. |
Task Failure
The execution of a task is spawned from the GroomServer so that the failure of the task would not pull down the GroomServer. Following steps are performed in the senario of task failure.
GroomServer Failure
NodeMonitor: a component monitors the healthy of GroomServers.
NodeManager: a component that collects metrics information whilst NodeMonitor requests to report status of the GroomServer it runs on.
\[1\]. [MapReduce]: simplified data processing on large clusters. http://portal.acm.org/citation.cfm?id=1327492 |
\[2\]. Dryad: distributed data-parallel programs from sequential building blocks. http://portal.acm.org/citation.cfm?id=1273005 |
\[3\]. Bulk Synchronous Parallel Computing -- A Paradigm for Transportable Software. http://portal.acm.org/citation.cfm?id=798134 |