...
- Check local disk:
- Check local disk if writable.
- Delete files under local dir (bsp.local.dir)
- Clear state table:
Wiki Markup Clean up tasks (Map\[TaskAttemptID \-> TaskInProgress\])
Wiki Markup Initialize job (Map\[BSPJobID \-> RunningJob\])
Wiki Markup Cleanup running tasks (Map\[TaskAttemptID \-> TaskInProgress\])
- Configure max tasks, default to 3.
- Start http server: An embedded http service.
- Start task report server: Communication between GroomServer and spawned child task. See TaskRunner.BspChildRunner
- Start worker server: RPC service listens to master's direction.
- Register to BSPMaster: Enroll itself to BSPMaster with GroomServerStatus.
- Start message dispatcher (Instructor): (TODO: refactor needed)
- Start monitor service: A process export metrics, task status, etc. information.
State Diagram
GroomServer state includes
- NORMAL: Everything works fine.
- STALE: This happens when DiskErrorException is thrown.
- DENIED: This indicates failing to establish connection to BSPMaster.
Task Management
The GroomServer
- receives instructions from BSPMasters.
- spawns one or more tasks as separated jvm processes where tasks are then executed.
- monitors spawned processes via ping; when a task is
- out of contact (failure/ crashed): launch a new process and restart the task with max attempt set to 3
- exceeding max attempt: update task status/ notify BSPMasters
- sends heartbeat to BSPMasters
Scenario
Normal Case
- A GroomServer forks a new task.
- The spawned task ack with metrics to GroomServer.
- Upon reception of task's ack, the GroomServer exports metrics to monitor system.
Failure Case
- A task failure event happens.
- A corresponded GroomServer detects the task failure.
- The GroomServer tries to restart the task.
- If failing restart, the GroomServer export by marking the task as failed.
Components
- Registrator: Register to GroomManager
- TaskManager: Perform tasks management.
- Launch a task.
- Stop a task.
- Kill a task.
- Resume a task.
- Monitor: Report GroomServer and tasks related information.