Distributed computing system such as MapReduce, and Dryad provide fault tolerance feature to help the system survive over the process crash. It is particular useful when computation requires to finish its execution in long time. Hama, based on the BSP model, is a framework for massive scientific computations, which also requires this feature so that developers and users who exploit this framework can benefit from it. This page serves for providing information on direction how Hama GroomServer fault tolerance would work.
In general, a system designed to deal with failures usually need to apply techniques including unit of mitigation, redundancy, fault detection, fault recovery, and so on.
Unit of mitigation: GroomServer(s)/ BSPMaster
Redundant units: GroomServer(s)
Fault detection: System monitor, heartbeat.
Fault recovery: Fail over
Whilst executing a task, the task will periodically ping its parent GroomServer.
If the GroomServer does not receive ping from the child (with timeout), it checks if child jvm is running; for instance, execute jps to identify child's status.
- If task rescheduled reaches the limit, the whole job fails.
Dispatch task(s) to GroomServer(s).
Dispatch task(s) to designed GroomServer(s).
. Dryad: distributed data-parallel programs from sequential building blocks. http://portal.acm.org/citation.cfm?id=1273005
. Bulk Synchronous Parallel Computing -- A Paradigm for Transportable Software. http://portal.acm.org/citation.cfm?id=798134
. Patterns for Fault Tolerant Software. http://portal.acm.org/citation.cfm?id=1557393
. Supervisor Behaviour. http://www.erlang.org/doc/design_principles/sup_princ.html
. Extensible Resource Management For Cluster Computing. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=603418