Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

GroomServerFaultTolerance (Draft)

Many Other

Literature Review

Introduction

Wiki Markup
Distributed computing system such as Hadoop\[1\], and Dryad\[2\] provide fault tolerance feature to help the system survive over the process crash. It is particular useful when computation requires to finish its execution in long time. Hama, based on the BSP\[3\] model, is a framework for massive scientific computations, which also requires this feature so that developers and users who exploit this framework can benefit from it. This page serves for providing information on direction how Hama [GroomServer] fault tolerance would work. 

Literature Review

Architecture

Glossary

NodeManager

Failure Detector

Supervisor behaviour

References

Wiki Markup
\[1\]. Hadoop. http://hadoop.apache.org/

Wiki Markup
\[2\]. Dryad: distributed data-parallel programs from sequential building blocks. http://portal.acm.org/citation.cfm?id=1273005

Wiki Markup
\[3\]. Bulk Synchronous Parallel Computing -- A Paradigm for Transportable Software. http://portal.acm.org/citation.cfm?id=798134
Architecture