...
Given the wide range of system management interfaces (e.g. IPMI, iLO, DRAC etc.), the out-of-band management service will separate the general out-of-band management functions from the implementation using plugins. This service uses a FSM (finite state machine) to transition a host's out-of-band management power state, following are the valid out-of-band management power state of host:
The FSM defaults to UNKNOWN and transitions based on power state retrieved. If the power status can not be retrieved (e.g. no provider available, not configured, unable to connect to the system management interface), the state remains UNKNOWN. The power management service will provide to the following operations to affect changes to this state machine:
The Out-of-band Management service also needs to account for the following failure scenarios:
The Management Server will regularly poll the system management interface of each host to refresh power state and verify credentials. A background sync thread will schedule STATUS check jobs and update out-of-band management host power state using a FSM. Each out-of-band management will owned by a management server who will be responsible for performing sync checks on them. For example, in case of multiple management server environment, all of them will share loads of performing such background sync checks. To protect against slow responses from the out-of-band management interface, all out-of-band management operations should be bounded by a timeout. The timeout will be configured globally with a per-cluster override. Finally, the power management FSM state transitions work as follows:
The finite state machine is a state machine that governs how states change based on an action or event. (Note: the diagram lacks arrows from Disable to On/Off, but the FSM should also honour such a transition)
The out-of-band management power state FSM is described in the following diagram:
The Power Management Service will provide the following new root admin APIs:
The out-of-band management driver could be implemented in order to support any of the popular variants. In general, most servers have a IPMI 2.0 supported on their out-of-band management interface so the feature will include a ipmitool based driver. No Apache license-compatible reliable Java library was available so this driver shells out to run ipmitool with some arguments.
The following diagram shows high level design and architecture of out-of-band management for CloudStack:
Global Setting Name | Description | Default values |
outofbandmanagement.action.timeout | The out of band management action timeout in seconds, configurable by cluster | 60 |
outofbandmanagement.ipmitool.interface | The out of band management IpmiTool driver interface to use. Default: lanplus. Valid values are: lan, lanplus, open etc. | lanplus |
outofbandmanagement.ipmitool.path | The out of band management ipmitool path used by the IpmiTool driver. Default: /usr/bin/ipmitool. | /usr/bin/ipmitool |
outofbandmanagement.ipmitool.retries | The out of band management IpmiTool driver retries option -R. Default 1. | 1 |
outofbandmanagement.sync.interval | The out of band management background sync thread interval in seconds | 900 |
outofbandmanagement.sync.poolsize | The out of band management background sync thread pool size | 50 |
The outofbandmanagement.ipmitool.* settings are specific to the ipmitool driver. Changing these values don't require restarting management server, update values are picked up dynamically.
The outofbandmanagement.ipmitool.interface is the interface option that will be used while running the ipmitool driver (with a -I <the value provided to this option>). The default value implies to have ipmitool use the lanplus interface which is IPMI 2.0. The outofbandmanagement.ipmitool.retries option is the number of retries for each command execution (this is used with a -R <whatever values you provide> while running ipmitool). For more details run ipmitool -h or readhttp://linux.die.net/man/1/ipmitool.
For any IPMI 2.0 compliant out-of-band management interface, these options/values should work out of the box. For example, on both iDRAC and iLO with IPMI2.0 enabled the feature should work out of box.
The outofbandmanagement.sync.interval is the amount of time after which the background thread starts a new thread to scan available hosts with out-of-band management enabled and tries to find and update power state of the hosts. Changing this setting will require restarting the management server as the threadpool is only initialized when mgmt server starts.
The outofbandmanagement.sync.poolsize is the number of max. no. of ipmitool background power state scanners that can run at a time. Based on the maximum no. of hosts you've, you can increase/decrease the value depending on how much stress your management server host can endure. It will take at most (number of total out-of-band-management enabled hosts in that round * outofbandmanagement.action.timeout / outofbandmanagement.sync.poolsize) seconds to complete a background power-state sync scan per round. Changing this setting will require restarting the management server as the threadpool executor is only confifgured and initialized when the mgmt server starts.
A. Power state icons:
B. Out-of-band management power state in Host view and Host metrics view:
C. Out-of-band management power state in Host metrics view:
D. Out-of-band management operation icons and information tab on Host view page:
E. Out-of-band management configuration dialog box:
F. Out-of-band management issue power action dialog:
G: Out-of-band management change password dialog:
For testing the feature, a ipmi simulator can be used that is purely Python based: https://pypi.python.org/pypi/ipmisim
`ipmisim` was further used as an importable library to write Marvin tests, can be installed using: pip install --upgrade ipmisim.
Note for OSX users: OSX comes with ipmitool version 2.5b1, which is not compatible with ipmisim. To use ipmisim just do a brew install ipmitool, and verify that the version is 1.8 +.
To run out-of-band management test:
$ nosetests --with-xunit --xunit-file=integration-test-results.xml --with-marvin --marvin-config=<cfg file> -s -a tags=advanced,required_hardware=true --zone=LangurCloud-basic --hypervisor=KVM test/integration/component/test_outofbandmanagement.py
test_01_configure_oobm_invalid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_01_configure_oobm_invalid | Status : SUCCESS ===
ok
test_02_configure_oobm_valid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_02_configure_oobm_valid | Status : SUCCESS ===
ok
test_03_enabledisable_oobm_invalid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_03_enabledisable_oobm_invalid | Status : SUCCESS ===
ok
test_04_enabledisable_oobm_valid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_04_enabledisable_oobm_valid | Status : SUCCESS ===
ok
test_05_enabledisable_across_clusterzones_oobm_valid (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_05_enabledisable_across_clusterzones_oobm_valid | Status : SUCCESS ===
ok
test_06_oobm_issue_power_action (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_06_oobm_issue_power_action | Status : SUCCESS ===
ok
test_07_oobm_background_powerstate_sync (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_07_oobm_background_powerstate_sync | Status : SUCCESS ===
ok
test_08_multiple_mgmt_server_ownership (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_08_multiple_mgmt_server_ownership | Status : SUCCESS ===
ok
test_09_oobm_change_password (integration.component.test_outofbandmanagement.TestOutOfBandManagement) ... === TestName: test_09_oobm_change_password | Status : SUCCESS ===
ok
----------------------------------------------------------------------
Ran 9 tests in 238.677s
OK