...
- Whilst executing a task, the task will periodically ping its parent GroomServer.
- If the GroomServer does not receive ping from the child (with timeout), it checks if child jvm is running; for instance, execute jps to identify child's status.
- GroomServer reports failure back to NodeMonitor.NodeMonitor notifies TaskScheduler that a task failure.
- TaskScheduler updates JobInProgress.
- TaskScheduler reschedules task to another GroomServer by searching an appropriate GroomServer.
- If task rescheduled reaches the limit, the whole job fails.
...
- NodeManager embedded in the GroomServer periodically sends heartbeat to NodeMonitor in BSPMaster. Hama-370
- One of GroomServers fails, indicating BSPMaster loses heartbeat from a particular GroomServer.
- NodeMonitor Hama-363 collects metrics information, including CPU, memory, tasks, etc., from healthy NodeManagers.
- Dispatch task(s) to GroomServer(s).
- NodeMonitor notifies TaskScheduler the failure of GroomServers; and move failure GroomServer to black list (will move back when the failed GroomServer restarts).
- TaskScheduler searches node list looking for GroomServer(s) whose workload is not heavy (which GroomServer to go is corresponded to policy).
- Update task(s) JobInProgress by assigning failed tasks to the GroomServer found in previous step.
- Dispatch task(s) to designed GroomServer(s).
...