Abstract

CloudStack HA Service provides high availability to virtual machines (VMs) managed by CloudStack. It works with CloudStack orchestration to detect VM and hypervisor host failures and restart VMs to ensure the high availability of VMs.  CloudStack HA is designed to operate within one zone and is confined to host and VM failures only.  It does not offer availability over network, storage or complete zone failures.

Design Principles

CloudStack HA is designed to make sure HA does not corrupt the VMs. Workloads requiring HA are not run on ethereal VMs and, therefore, the data on these VMs must not be corrupted by the HA process.  CloudStack HA prioritizes VM integrity over VM availability. In conditions where CloudStack HA cannot safely restart a VM, it asks the administrator to be the arbitrator of whether it is safe to restart a VM.

CloudStack HA is designed to work with multiple technologies.  It abstracts these functionalities into detection, investigation, and fencing steps and provides plugin interfaces for new modules to be written to incorporate different technology.  The MTTR (mean time to recovery) for VM in CloudStack HA depends on the selection of technology.

HA Phases

CloudStack HA contains three different phases: Detection, Restart, and Manual Intervention. Detection deals with different ways to detect the possibilities of VM and hypervisor host failure.  Restart is the actual process to restart the failed VMs.  System administrators can manually intervene when Restart is unable to safely restart a VM.

Detection

VMs that generally failed due to the VM failure or hypervisor host failure can generally be safely restarted elsewhere.  CloudStack HA provides detection in both of these cases.

Detecting VM Failures

CloudStack HA continuously detects VM status changes on the hypervisor host to detect VMs that have failed on the hypervisor host.  This change detection is performed at a configurable interval (sync.interval) defaulted to 1 minute.  Upon receiving the change detection, CloudStack HA immediately proceeds to restart the VM.

During CloudStack’s VM orchestration, it is possible for VM operations to be stuck at the control plane of the hypervisor.  The reasons can vary from the connection to the control plane has been lost, to the control plane is busy, to the control plane is corrupted and needs to be restarted.  When this happens, CloudStack does not know for sure if a VM has been started or stopped because it cannot confirm the operation’s success/failure through the hypervisor’s control plane.  At the same time, it cannot simply reset the state of the VM to allow further operations.  For example, in starting a VM, the command has been sent to hypervisor A but no response has been received for a long time.  If the VM’s state is reset to Stopped, then it can be started on hypervisor B but if hypervisor A actually started the VM, the two VM instances can be writing to the same disk and causes a corruption in the VM.  Under these situations, CloudStack leaves a VM in a transition state (Starting/Stopping/Migrating) and informs CloudStack HA to restart the VM after ensuring the VM cannot be corrupted.

Detecting Hypervisor Host Failures

CloudStack HA checks for failures in the hypervisor host. Upon detecting a hypervisor host failure, VMs on that specific hypervisor host is restarted. This detection must distinguish between a connection lost to a hypervisor host and an actual failure of a hypervisor host. Differences between the capabilities of hypervisor types and even versions within the same hypervisor type can cause problems in proper detection. 

CloudStack HA achieves this by performing an application ping between CloudStack and the hypervisor host. When the application ping falls behind or is unable to be performed, CloudStack sends a CheckHealthCommand to the hypervisor host. If this command cannot be performed or takes a long time to perform, CloudStack HA then invokes host investigators to check on the status of the hypervisor host in question.

Currently, two types of host investigators have been implemented: network ping investigation and hypervisor specific investigation. Network ping investigation sends a PingTestCommand to a neighboring hypervisor host which carries out a network ping and network arping on the IP address of the hypervisor host in question. If the hypervisor host responds to the network ping or arping, the host is consider to be alive and no HA is performed.  If there is no response, the investigator returns that it cannot detect the status of the host. Note that it cannot return that the host is down because network ping and arping responses may be blocked and does not accurately indicate that the host is down.

If the network ping investigation returns that it cannot detect the status of the host, CloudStack HA then relies on the hypervisor specific investigation. For VMware, there is no such investigation as the hypervisor host handles its own HA. For XenServer and KVM, CloudStack HA deploys a monitoring script that writes the current timestamp on to a heartbeat file on shared storage.  If the timestamp cannot be written, the hypervisor host self-fences by rebooting itself. For these two hypervisors, CloudStack HA sends a CheckOnHostCommand to a neighboring hypervisor host that shares the same storage. The neighbor then checks on the heartbeat file on shared storage and see if the heartbeat is no longer being written. If the heartbeat is still being written, the host reports that the host in question is still alive. If the heartbeat file’s timestamp is lagging behind, after an acceptable timeout value, the host reports that the host in question is down and HA is started on the VMs on that host.

Restart Process

CloudStack HighAvailabilityManager provides an infrastructure to implements a high availability process. It is important to distinguish between what HighAvailabilityManager does and does not do. HighAvailabilityManager does not contain the code that magically finds out if a virtual machine is alive or to fence it off within a CloudStack deployment. CloudStack works with too many different types of hypervisors, storage, and other resources for that to be implemented effectively in a single block of code. HighAvailabilityManager’s job is to provide a well-defined process within which other components provide such capabilities.

HighAvailabilityManager defines three steps for its process.

  1. Investigation – Determines if the VM is up, down, or impossible to determine
  2. Fencing – Fences the VM from using the network or storage
  3. Start – Starts the VM the normal way.

For the Investigation and Fencing steps, HighAvailabilityManager relies on adapters to provide the functionality as different deployments may have different ways to perform those steps, depending on the physical equipment deployed. There may be multiple adapters for each step. Adapters can be added, removed, and configured within the components.xml. By the time step 3, Start, is performed, the VM is at Stopped state and a normal start is performed. HighAvailabilityManager does not contain code that starts a VM in any special way.

Each Investigator has three states to return. It can report that the VM is Stopped or Running. The third state is Unknown. It is important to distinguish between Stopped and Unknown. First, an Investigator should report Unknown when it doesn’t know how to handle that particular type of VM or the particular environment. For example, if an Investigator is written to interface with vCenter, it should report Unknown for VMs running on XenServer.  Second, the Investigator must know that a VM is Stopped to report Stopped. For example, let’s say you wrote an Investigator to perform a network ping on a VM to determine if it’s Running. If the VM responded and the mac address is as expected, then it should report the VM as Running. However, if there’s no response, that does not mean the VM is Stopped because a VM may not respond due to network disconnect or ICMP being disabled on the VM.

Each Fencer is responsible for making sure that the VM has been fenced from corrupting its own disk. It returns true if fenced and false if not fenced or unable to perform that task. The HighAvailabilityManager maintains a work queue in the database where it registers at which step the VM HA process is at.

Once the work has been scheduled, the HighAvailabilityManager worker threads will work through the queue. One each work item, HighAvailabilityManager first determines if the state of the VM has changed since the work was scheduled. If it has been changed, HighAvailabilityManager cancels the work. If not, then it checks to see if the VM requires investigation. It does not require investigation in the case where the host was determined to be Down or an agent reports that the VM is Stopped. If it requires investigation, HighAvailabilityManager asks each Investigator to investigate the VM until one Investigator returns the VM is Running or Stopped. If the VM is found to be Running, HighAvailabilityManager checks if the host the VM is on is Up. If the host is Up, it then cancels the task. If the host is Down, it reschedules the work item. If the VM is found to be Stopped, then the VM is restarted. If none of the Investigators can determine the state of the VM, HighAvailabilityManager moves on to ask the Fencers to fence off the VM. If one of the Fencers says it was able to fence off the VM, then the VM is restarted. If none of them can, the work item is rescheduled. Once the VM is successfully started, the work item is completed. If the VM start fails with an error that the HighAvailabilityManager understands, the work is rescheduled. If the error is unknown, then the work item is cancelled and alert is filed.

HighAvailabilityManager work items can be checked in the op_ha_work table. A cleanup thread cleans up cancelled or completed items in the background. As HighAvailabilityManager completes each step, it will write into this table what step it is at. The steps are Scheduled, Investigating, Fencing, Restarting, Done, Cancelled, Error.

There is a special condition called rolling-death that the HighAvailibilityManager handles specifically. Under certain conditions, restarting a VM causes deaths to systems due to bugs or other problems. HighAvailabilityManager will stop HAing a VM if within a certain period of time that same VM has been HAed for a certain amount of time.

The following is a flow chart for how HighAvailabilityManager works through its process.

Investigator

The need for Investigator rises from the fact that the technologies for the environments CloudStack is deployed into change rapidly. So depending on what new technology is available, Investigators can be written quickly and plugged into the HA process to help determine if a VM is Running or Stopped. The following are keys to the implementation of an Investigator.

  • Do not handle VM and environments it is not written to handle. Be very specific. Return Unknown if not sure.
  • Make sure not to take false negatives as to mean the VM is down.
  • Make sure not to take false positives as to mean the VM is running.

The following is an example Investigator.

UserVmDomRInvestigator

UserVmDomRInvestigator sends a command to the domR VM, a virtual machine that CloudStack starts to provide network services for the user VM. It utilizes the fact that the domrR VM is on the same network as the user VM and asks to arp-ping the user VM’s ip address. If the ip address responds to the arp-ping, then it returns Running. However, if the ip address does not respond, it returns Unknown. It never returns Stopped because there’s no way to tell from the network stand point that a VM is truly stopped.

Fencer

A Fencer, like an Investigator, provides for a way for new technology to fence off the VM. The following are keys to the implementation of a Fencer.

  • Do not handle VM and environments it is not written to handle. Be specific about what you can actually fence off.
  • Fencing must have happened if you are to return true.

The following are example Fencers.

XenServerFencer

XenServerFencer depends on the ability of the XenServer to self-fence on storage disconnect. Each XenServer writes a heartbeat on a central storage. If it is unable to write the heartbeat, the XenServer self-fences or reboots. XenServerFencer examines that the heartbeat fell behind and the XenServer have self-fenced so the VM is fenced off from writing to its disks.

RecreateFencer

RecreateFencer works on VMs with disks that are re-creatable because the data on it is either not useful or can be recreated on reboot. It returns that these VMs are fenced because a new disk can be created for that VM on every restart so no disk corruption can occur.

Configuration

HighAvailabilityManager is configurable via the following variables.

global settingdescriptiondefault value

stop.retry.interval

The time in seconds between retries to stop or destroy a VM.

600
restart.retry.intervalThe time (in seconds) between retries to restart a VM.600

time.between.cleanup

The time in seconds to wait before the cleanup thread runs for the different HA-Worker-Threads. The cleanup thread finds all the work items that were successful and is now ready to be purged from the the database (table: op_ha_work).86400
time.between.failuresTime in seconds before try to cleanup all the VMs which are registered for the HA event that were successful and are now ready to be purged.3600

max.retries

The number of times to try a restart for the different Work-Types:

Migrating - VMs off of a host, Destroy - a VM,
Stop - a VM for storage pool migration purposes,
CheckStop - checks if a VM has been stopped, ForceStop - force a VM to stop even if the states don't allow it,
Destroy - a VM and HA - restart a VM.

5

time.to.sleep

Time in seconds to sleep if no work items are found.

60

ha.workers

The number of High-Availability worker threads to spin off to do the processing.

5

Manual Intervention

As part of the CloudStack HA design, CloudStack HA must not restart a VM if it cannot be sure that it can be done safely.  In those cases, the VM stays in a transitional state.  In those cases, the system administrator can manually intervene by checking to make sure the VM or the host the VM is shutdown, and then calling CloudStack to stop the VM with forced stop flag set to true.  When called in this manner, CloudStack attempts to stop the VM but will continue to release resources even if any of the operations to the hypervisor host cannot be completed. After doing this, CloudStack HA will stop any attempt to restart the VM and it can be restarted by the administrator. 

Putting it together

The above sequence diagram shows the normal application ping and storage heartbeat being sent by the hypervisor 1 and 2. At some point, the application ping falls behind and the monitor informs CloudStack that the application ping has fallen behind. CloudStack then launches an investigation. On the first timeout, CloudStack was still able to contact the host so it leaves the host's status unchanged. On the second timeout, CloudStack was no longer able to contact the host, so it asks the investigator to check on the host. The first investigator, a PingTestInvestigator, was inconclusive because it timed out on the network ping. The second investigator, the XenServerInvestigator, finds that the storage heartbeat was no longer being written to. At this stage CloudStack hands off to HAManager to HA all of the VMs.

The above sequence diagram shows HA Manager's investigation process.

  • No labels