Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Introduction

GroomServer is a process whose main responsibility is to manage bsp tasks manages Tasks by spawning new Java processes. This fault isolation mitigates the problem when tasks process fails. In addition to task management, GroomServer collaborates with BSPMaster so that job execution can be done correctly. Works that GroomServer performs include:

  1. Check local disk:
    1. Check local disk if writable.
    2. Delete files under local dir (bsp.local.dir)
  2. Clear state table:
    1. Wiki Markup
      Clean up tasks (Map\[TaskAttemptID \-> TaskInProgress\])
    2. Wiki Markup
      Initialize job (Map\[BSPJobID \-> RunningJob\])
    3. Wiki Markup
      Cleanup running tasks (Map\[TaskAttemptID \-> TaskInProgress\])
    4. Configure max tasks, default to 3.
  3. Start http server: An embedded http service.
  4. Start task report server: Communication between GroomServer and spawned child task. See TaskRunner.BspChildRunner
  5. Start worker server: RPC service listens to master's direction.
  6. Register to BSPMaster: Enroll itself to BSPMaster with GroomServerStatus.
  7. Start message dispatcher (Instructor): (TODO: refactor needed)
  8. Start monitor service: A process export metrics, task status, etc. information.

Task Management

The GroomServer

  • receives instructions from BSPMasters.
  • spawns one or more tasks as separated jvm processes where tasks are then executed.
  • monitors spawned processes via ping; when a task is
    • out of contact (failure/ crashed): launch a new process and restart the task with max attempt set to 3
    • exceeding max attempt: update task status/ notify BSPMasters
  • sends heartbeat to BSPMasters

Scenario

Normal Case

Image Removed

Failure Case

Image Removed

  • A task failure event happens.
  • A corresponded GroomServer detects the task failure.
  • The GroomServer tries to restart the task.
  • If failing restart, the GroomServer export by marking the task as failed.

Components

...

  • Launch a task.
  • Stop a task.
  • Kill a task.
  • Resume a task.

periodically reports task status back to BSPMaster.

State Diagram

GroomServer state includes

  • NORMAL: Everything works fine.
  • STALE: This happens when DiskErrorException is thrown.
  • DENIED: This indicates failing to establish connection to BSPMaster.

Image Added

...