Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Whilst executing a task, the task will periodically ping its parent GroomServer.
  2. If the GroomServer does not receive ping from the child (with timeout), it checks if child jvm is running; for instance, execute jps to identify child's status.
  3. GroomServer reports failure back to NodeMonitor.NodeMonitor notifies TaskScheduler that a task failure.
  4. TaskScheduler updates JobInProgress.
  5. TaskScheduler reschedules task to another GroomServer by searching an appropriate GroomServer.
  6. If task rescheduled reaches the limit, the whole job fails.

...

  1. NodeManager embedded in the GroomServer periodically sends heartbeat to NodeMonitor in BSPMaster. Hama-370
  2. One of GroomServers fails, indicating BSPMaster loses heartbeat from a particular GroomServer.
  3. NodeMonitor Hama-363 collects metrics information, including CPU, memory, tasks, etc., from healthy NodeManagers.
  4. Dispatch task(s) to GroomServer(s).
    1. NodeMonitor notifies TaskScheduler the failure of GroomServers; and move failure GroomServer to black list (will move back when the failed GroomServer restarts).
    2. TaskScheduler searches node list looking for GroomServer(s) whose workload is not heavy (which GroomServer to go is corresponded to policy).
    3. Update task(s) JobInProgress by assigning failed tasks to the GroomServer found in previous step.
    4. Dispatch task(s) to designed GroomServer(s).

...