Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Wiki Markup
Distributed computing system such as [MapReduce]\[1\], and Dryad\[2\] provide fault tolerance feature to help the system survive over the process crash. It is particular useful when computation requires to finish its execution in long time. Hama, based on the BSP\[3\] model, is a framework for massive scientific computations, which also requires this feature so that developers and users who exploit this framework can benefit from it. This page serves for providing information on direction how Hama [GroomServer] fault tolerance would work. 

Literature Review

Wiki Markup
In general, a system designed to deal with failures largely bases on the concepts including unit of mitigation, redundancy, fault observer\[4\]. 

The architecture defines the basic unit which performs functions of a system according to requirements.

Providing redundant units.

Fault observers are designed to detect fault or error in an earlier stage so that other strategies, such as error recovery can be employed to correct the problem.

Architecture

Task Failure

The execution of a task is spawned from the GroomServer so that the failure of the task would not pull down the GroomServer. Following steps are performed in the senario of task failure.

...

Wiki Markup
\[3\]. Bulk Synchronous Parallel Computing -- A Paradigm for Transportable Software. http://portal.acm.org/citation.cfm?id=798134

Wiki Markup
\[4\]. Patterns for Fault Tolerant Software. http://portal.acm.org/citation.cfm?id=1557393

Wiki Markup
\[5\]. Supervisor Behaviour. http://www.erlang.org/doc/design_principles/sup_princ.html

Wiki Markup
\[6\]. Extensible Resource Management For Cluster Computing. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=603418