GroomServerFaultTolerance (Draft)
Introduction
Distributed computing system such as Hadoop[1], and Dryad[2] provide fault tolerance feature to help the system survive over the process crash. It is particular useful when computation requires to finish its execution in long time. Hama, based on the BSP[3] model, is a framework for massive scientific computations, which also requires this feature so that developers and users who exploit this framework can benefit from it. This page serves for providing information on direction how Hama GroomServer fault tolerance would work.
Literature Review
Architecture
- NodeMaanger embedded in the GroomServer periodically sends heartbeat to NodeMonitor in BSPMaster. // Can't attach diagram
- One of GroomServers fails, indicating BSPMaster loses heartbeat from a particular GroomServer. // Can't attach diagram
- NodeMonitor collects metrics information, including CPU, memory, tasks, etc., from healthy NodeManagers. // Can't attach diagram
- Dispatch task(s) to GroomServer(s). // Can't attach diagram
- NodeMonitor notifies TaskScheduler the failure of GroomServers; and move failure GroomServer to black list (will move back when the failed GroomServer restarts).
2. TaskScheduler searches node list looking for GroomServer(s) whose workload is not heavy (which GroomServer to go is corresponded to policy).
3. Update task(s) JobInProgress by assigning failed tasks to the GroomServer found in previous step.
4. Dispatch task(s) to designed GroomServer(s).
Glossary
NodeMonitor: a component monitors the healthy of GroomServers.
NodeManager: a component that collects metrics information whilst NodeMonitor requests to report status of the GroomServer it runs on.
References
[1]. Hadoop. http://hadoop.apache.org/
[2]. Dryad: distributed data-parallel programs from sequential building blocks. http://portal.acm.org/citation.cfm?id=1273005
[3]. Bulk Synchronous Parallel Computing – A Paradigm for Transportable Software. http://portal.acm.org/citation.cfm?id=798134