Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Support access to a (hypervisor) host’s out-of-band management interface (e.g. IPMI, iLO, DRAC, etc.) to manage host power operations (on/off etc.) and querying current power state. (Note: this feature applies for hypervisor hosts) 

 Bug Reference

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyCLOUDSTACK-9299

...

Version

Author / Reviewer

Date

1.0

Rohit Yadav / ACS-dev community

13/March/2016

  

 

Use case

The following are some common use cases for host out-of-band management:
  • Restarting stalled/failed (hypervisor) hosts
  • Powering off hosts (under-utilised hosts with no VMs)
  • Powering on hosts for provisioning and to increase capacity (for hosts that are already provisionedadded in CloudStack)
  • Allowing system administrators to see the current power state of the host

Feature Specifications

Given the wide range of system management interfaces (e.g. IPMI, iLO, DRAC etc.), the power out-of-band management service will separate the general power out-of-band management functions from the implementation using plugins. This service uses a FSM (finite state machine) to transition a host's out-of-band management power state, following are the valid out-of-band management  power state of host:

  • On: The host is powered on
  • Off: The host is powered off
  • Unknown: The host's power status cannot be determined
  • Disabled: Out-of-band management is not enabled for the host

The FSM defaults to UNKNOWN and transitions based on power state retrieved. If the power status can not be retrieved (e.g. no provider available, not configured, unable to connect to the system management interface), the state remains UNKNOWN. The power management service will provide to the following operations to affect changes to this state machine: 

  • ON: hard power on a host resulting in the ON state
  • OFF: hard power off a host resulting in the OFF state
  • CYCLE: hard power off and power of a host resulting in the ON State
  • RESET: warm boot of a host resulting in the ON state
  • SOFT: soft shutdown via ACPI of a host resulting in the OFF state

The Out-of-band Management service also needs to account for the following failure scenarios:

  • The power status of the host is changed outside of CloudStack (e.g. a datacenter technician manually powers down a host)
  • Connectivity to the host's system management interface is lost
  • The credentials of the out-of-band management were changed without updating them in CloudStack, notify admin via alerts/email

The Management Server will regularly poll the system management interface of each host to refresh power state and verify credentials. A background sync thread will schedule STATUS check jobs and update out-of-band management host power state using a FSM. Each out-of-band management will owned by a management server who will be responsible for performing sync checks on them. For example, in case of multiple management server environment, all of them will share loads of performing such background sync checks. To protect against slow responses from the out-of-band management interface, all out-of-band management operations should be bounded by a timeout. The timeout will be configured globally with a per-cluster override. Finally, the power management FSM state transitions work as follows:

  • ON->OFF: The out-of-band management  power state sync daemon detects that the host was turned off
  • OFF->ON: The out-of-band management power state sync daemon detects that the host was turned on
  • ON→UNKNOWN, OFF->UNKNOWN: The out-of-band management  power state sync daemon loses connectivity to the system management controller
  • UNKNOWN->OFF, UNKNOWN->ON: The power state sync daemon gains connectivity to the system management controller
  • *Any -> DISABLED: The out-of-band management is disabled for a host
  • DISABLED→UNKNOWN, DISABLED->ON, DISABLED->OFF: When out-of-band management is enabled, based on the event state can transition from a disabled state for a host
  • User triggered power management action
  • All changes to a host's out-of-band power state would be broadcast on the event bus

The finite state machine is a state machine that governs how states change based on an action or event. (Note: the diagram lacks arrows from Disable to On/Off, but the FSM should also honour such a transition)

The out-of-band management power state FSM is described in the following diagram:

 

Implementation

The Power Management Service will provide the following new root admin APIs:

  • changeOutOfBandManagementPassword: Changes the password system management interface password on the host and in the Management Server database. If the change the password fails on either the host or in the database, the password will be set to the old password in both the Management Server database and on the host.
  • configureOutOfBandManagement: Add or update a configuration about a host’s system management interface (e.g. host/ip address, port, credentials, driver, etc.)
  • enableOutOfBandManagementForHost/disableOutOfBandManagementForHost: Toggle whether or not CloudStack should use the configured out-of-band management for a host
  • enableOutOfBandManagementForCluster/disableOutOfBandManagementForClusterenableOutOfBandManagement/disableOutOfBandManagement: Toggle whether or not CloudStack should use the configured out-of-band management in a cluster. The config is not cascaded to the hosts, but kept separately in cluster_details table.
  • enableOutOfBandManagementForZone/disableOutOfBandManagementForZone: Toggle whether or not CloudStack should use the configured out-of-band management. The config is not cascaded to hosts/clusters, but kept separately in data_center_details table.
  • issueOutOfBandManagementPowerAction: Initiates the specified power management function (see above) to the system management interface. This API will support an optional operation timeout which would override the global and cluster-level settings. If hosts are put in maintenance mode, in the UI executing any power action should share a warning with the admin user in the popup.

The out-of-band management driver could be implemented in order to support any of the popular variants. In general, most servers have a IPMI 2.0 supported on their out-of-band management interface so the feature will include a ipmitool based driver. No Apache license-compatible reliable Java library was available so this driver shells out to run ipmitool with some arguments.

The following diagram shows high level design and architecture of out-of-band management for CloudStack:

Global settings

Global Setting NameDescriptionDefault values
outofbandmanagement.action.timeoutThe out of band management action timeout in seconds, configurable by cluster60
outofbandmanagement.ipmitool.interfaceThe out of band management IpmiTool driver interface to use. Default: lanplus. Valid values are: lan, lanplus, open etc.lanplus
outofbandmanagement.ipmitool.pathThe out of band management ipmitool path used by the IpmiTool driver. Default: /usr/bin/ipmitool./usr/bin/ipmitool
outofbandmanagement.ipmitool.retriesThe out of band management IpmiTool driver retries option -R. Default 1.1
outofbandmanagement.sync.intervalThe out of band management background sync thread interval in seconds900
outofbandmanagement.sync.poolsizeThe out of band management background sync thread pool size50

The outofbandmanagement.ipmitool.* settings are specific to the ipmitool driver. Changing these values don't require restarting management server, update values are picked up dynamically.

The outofbandmanagement.ipmitool.interface is the interface option that will be used while running the ipmitool driver (with a -I <the value provided to this option>). The default value implies to have ipmitool use the lanplus interface which is IPMI 2.0. The outofbandmanagement.ipmitool.retries option is the number of retries for each command execution (this is used with a -R <whatever values you provide> while running ipmitool). For more details run ipmitool -h or readhttp://linux.die.net/man/1/ipmitool

For any IPMI 2.0 compliant out-of-band management interface, these options/values should work out of the box. For example, on both iDRAC and iLO with IPMI2.0 enabled the feature should work out of box.

The outofbandmanagement.sync.interval is the amount of time after which the background thread starts a new thread to scan available hosts with out-of-band management enabled and tries to find and update power state of the hosts. Changing this setting will require restarting the management server as the threadpool is only initialized when mgmt server starts.

The outofbandmanagement.sync.poolsize is the number of max. no. of ipmitool background power state scanners that can run at a time. Based on the maximum no. of hosts you've, you can increase/decrease the value depending on how much stress your management server host can endure. For example, assuming an environment with the default global setting and say 5000 hosts It will take at most (number of total out-of-band-management enabled hosts in that round * outofbandmanagement.action.timeout / outofbandmanagement.sync.poolsize) seconds or (5000 * 60 / 50)s, or 5000 seconds to complete on a background power-state sync scan per round. To ensure no overlapping scan attempts you should set the outofbandmanagement.sync.interval to 5000 or increase the sync.poolsize. Changing Changing this setting will require restarting the management server as the threadpool executor is only confifgured and initialized when the mgmt server starts.

UI

A. Power state icons:

  • Green: On state
  • Red: Off state
  • Orange: Unknown state
  • Grey: Disable state

B. Out-of-band management power state in Host view and Host metrics view:

 

C. Out-of-band management power state in Host metrics view:

 

D. Out-of-band management operation icons and information tab on Host view page:

 

E. Out-of-band management configuration dialog box:

 

F. Out-of-band management issue power action dialog:

 

G: Out-of-band management change password dialog:

Testing

For testing the feature, a ipmi simulator can be used that is purely Python based: https://pypi.python.org/pypi/ipmisim

`ipmisim` was further used as an importable library to write Marvin tests, can be installed using: pip install --upgrade ipmisim.

Note for OSX users: OSX comes with ipmitool version 2.5b1, which is not compatible with ipmisim. To use ipmisim just do a brew install ipmitool, and verify that the version is 1.8 +. 

 

 

To run out-of-band management test:

$ nosetests --with-xunit --xunit-file=integration-test-results.xml --with-marvin --marvin-config=<cfg file> -s -a tags=advanced,required_hardware=true --zone=LangurCloud-basic --hypervisor=KVM  test/integration/component/test_outofbandmanagement.py

test_01_configure_oobm_invalid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_01_configure_oobm_invalid | Status : SUCCESS ===
ok
test_02_configure_oobm_valid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_02_configure_oobm_valid | Status : SUCCESS ===
ok
test_03_enabledisable_oobm_invalid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_03_enabledisable_oobm_invalid | Status : SUCCESS ===
ok
test_04_enabledisable_oobm_valid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_04_enabledisable_oobm_valid | Status : SUCCESS ===
ok
test_05_enabledisable_across_clusterzones_oobm_valid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_05_enabledisable_across_clusterzones_oobm_valid | Status : SUCCESS ===
ok
test_06_oobm_issue_power_action (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_06_oobm_issue_power_action | Status : SUCCESS ===
ok
test_07_oobm_background_powerstate_sync (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_07_oobm_background_powerstate_sync | Status : SUCCESS ===
ok
test_08_multiple_mgmt_server_ownership (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_08_multiple_mgmt_server_ownership | Status : SUCCESS ===
ok
test_09_oobm_change_password (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_09_oobm_change_password | Status : SUCCESS ===
ok

----------------------------------------------------------------------
Ran 9 tests in 238.677s

OK