Differences between revisions 24 and 25
Revision 24 as of 2014-03-18 04:18:30
Size: 2329
Editor: ChiaHungLin
Comment:
Revision 25 as of 2014-04-20 06:30:04
Size: 2283
Editor: ChiaHungLin
Comment:
Deletions are marked like this. Additions are marked like this.
Line 23: Line 23:
 * [[ResourceConsultant|ResourceConsultant]]

Introduction

The main responsibility of BSPMaster can be found at Architecture

Services

The BSPMaster is a collection of services performing different tasks, including:

  • masterServer: An RPC server.
  • instructor: Asynchronous message dispatcher.
  • taskScheduler: A task scheduling service.
  • infoServer: A http server.
  • supervisor: (TODO: move to Monitor?)
  • systemDirCleaner: Cleanup system directory, default /tmp/hadoop/bsp/system, on HDFS.
  • syncClient: BSPMaster ZooKeeper client (TODO:curator?)

  • timer service: TODO


State

Two states are applied to BSPMaster node, including:

  • INITIALIZING
  • RUNNING
  • FAILED
  • SHUTTING DOWN
  • STOPPED

BSPMaster State

Scenario

  • Restart
    • When a reported task fails on a groom server, restart that job by re-running all tasks from the latest checkpoint that universally available. The reason not merely re-running the task that fails comes from the fact that universally available checkpoint may not be only one step behind the current superstep. This may lead to the deadlock between alive tasks and the restarted one during sync phase. For example, the universally checkpoint available is the 6th superstep, and currently running the computation from the 7th to 8th superstep. Suppose one of the tasks fails, then the system migrates the failed task to another machine and resumes the failed task from the 6th superstep checkpoint whilst other tasks keep continuously running until hitting the barrier sync at the superstep 8th. Now the dead lock is raised when the resumed task, that previous fails, hits the barrier sync at the superstep 7th because no other tasks are at the superstep 7th. There is one proposed solution to fix a task failure issue. A more complicated logic can be applied for this issue, but right now may just implement the simpler one.

Source

BSPMaster.java

BSPMaster (last edited 2014-04-20 06:30:04 by ChiaHungLin)