Ambari auto start for services and components
Summary
This document describes the Ambari auto start feature before and after version 2.4.0.
Ambari auto start is a feature that enables certain components to be marked for auto start so that whenever a node restarts, ambari agent automatically restarts the stopped components. Auto start of a component is based on its current state and desired state.
Before Ambari 2.4.0
Auto start of services and components is supported via ambari.properties file using several properties. However, this approach is static - anytime auto start for a service component is required to be turned on or off, these properties in ambari.properties have to be modified and ambari server has to be restarted for the changes to go into effect. Moreover, ambari agent has to be restarted so that it can bootstrap with the server to get the auto start configuration.
Ambari 2.4.0
Auto start is dynamic. No restart of ambari server or ambari agent is required for any changes to take effect. All auto start properties reside in the database. API support has been added to configure the auto start setting for services and have ambari server communicate the changes to the ambari agents during the subsequent registration or heartbeat. Ambari web (UI) uses the APIs to dynamically control the auto start settings.
How auto start works in Ambari versions prior to 2.4.0
When an ambari agent starts, it bootstraps with the ambari server via registration. The server sends information to the agent about the components that have been enabled for auto start along with the other auto start properties in ambari.properties. The agent compares the current state of these components against the desired state, to determine if these components are to be installed, started, restarted or stopped.
Ambari.properties
To enable components for auto start, specify them using recover.enabled_components=A,B,C
# Enable Metrics Collector auto-restart
recovery.type=AUTO_START
recovery.enabled_components=METRICS_COLLECTOR
recovery.lifetime_max_count=1024
Here’s a sample snippet of the auto start configuration that is sent to the agent by the server during agent registration:
"recoveryConfig": {
"type" : "AUTO_START",
"maxCount" : 10,
"windowInMinutes" : 60,
"retryGap" : 0,
"enabledComponents" : "a,b",
“disabledComponents”: “c,d”
}
For example, if the current state of METRICS_COLLECTOR component on a host is INSTALLED but it is enabled for auto start, the desired state is STARTED. The recovery manager generates a start command for METRICS_COLLECTOR which is executed by the controller.
Recovery scenarios
Depending on the value of recovery_type (DEFAULT, AUTO_START, FULL) attribute in ambari.properties file, the following recovery commands are supported. DEFAULT means auto start is disabled by default.
Summary of recovery_type values and state transitions
Attribute: recovery_type | Commands | State Transitions |
AUTO_START | Start | INSTALLED → STARTED |
FULL | Install, Start, Restart, Stop | INIT → INSTALLED, INIT → STARTED, INSTALLED → STARTED, STARTED → STARTED, STARTED → INSTALLED |
DEFAULT | None | Auto start feature disabled |
Detailed state transitions for various recovery_type values
Current state | Desired state | Recovery command | Recovery mode | Remarks |
INSTALLED | STARTED | Start | AUTO_START | Start a component |
INSTALLED | STARTED | Start | FULL | Start a component |
INSTALLED | INSTALLED | Install | FULL | Stale component configurations. |
INIT | STARTED | Install | FULL | Start a component |
INIT | INSTALLED | Install | FULL | Install a component |
STARTED | STARTED | Restart | FULL | Stale component configurations |
STARTED | INSTALLED | Stop | FULL | Stop a component |
How auto start works in Ambari version 2.4.0
Recovery scenarios
Please note that only Auto start recovery mode is supported, i.e., components that are in INSTALLED state can be transitioned to STARTED state. Ambari server sends the AUTO_START value for recovery type to the agent. Sample recovery configuration sent by the server to the agent:
"recoveryConfig": {
"type" : "AUTO_START",
"maxCount" : 10,
"windowInMinutes" : 60,
"retryGap" : 0,
"components" : "a,b",
"recoveryTimestamp" : 1458150424380
}
Enabling or disabling auto start feature from the UI:
New RESTful APIs to capture the service and component names for auto start
Support for multi instance services and components
Fresh installs and upgrades
In a fresh install, all services will be set to auto start by default. In upgrades this will not be the default. The user has to enable auto start via the UI.
Maintenance mode
Auto start will be ignored for host components which are in maintenance mode. A host component can be in maintenance mode due one or more of following reasons:
The host component was placed in maintenance mode
The host was placed in maintenance mode
The service was placed in maintenance mode
The cluster where the hosts belongs to was placed in maintenance mode.
Maintenance state of a component is got from the maintenance_state field in hostcomponentdesiredstate table:
cluster_id | host_id | service_name | component_name | maintenance_state |
Auto start properties
Auto start setting is per service instance and stored in recovery_enabled field in servicecomponentdesiredstate table. However, all the other properties like recovery.type, recovery.lifetime_max_count, recovery.max_count, recovery.window_in_minutes, recovery.retry_interval will be global - applies to all service/component instances in that cluster and stored in the clusterconfig table for the cluster-env property. This is because having per service instance or component instance level setting will be too noisy with little or no benefit.
Persistence
Properties for auto start will be stored in the database. The idea is to use servicecomponentdesiredstate and clusterconfig table and distribute the information across these tables.
Blueprint based deployments
Microsoft uses blueprints for deployment (headless deployments). Blueprints do not have any room for specifying settings properties. Blueprint schema will have to be modified to accommodate settings. All components are auto started.
Specify a set or all of the components to be auto started. If it is a set, then explicitly call out the list of components. For all components, specify * since we don’t know the list of all components.
Specify METRICS_COLLECTOR as the default auto started component in both UI and blueprint, in the stack definition, with the ability for the blueprint authors to remove METRICS_COLLECTOR from getting auto start.
Blueprints can override the default list specified in the stack definition. During deployment, the servicecomponentdesiredstate table’s recovery_enabled field is set to true or false for each component.
Attributes will be stored in cluster-env.xml. Cluster-env.xml contains the following non-volatile properties:
recovery_type
recovery_lifetime_max_count
recovery_max_count
recovery_window_in_minutes
recovery_retry_interval
recovery_enabled
/var/lib/ambari-server/resources/stacks/HDP/<version>/configuration/cluster-env.xml
<configuration>
<property>
<name>recovery_type</name>
<value>AUTO_START</value>
<description>Recovery type</description>
</property>
:
:
</configuration>
Enabling components for auto start
Components can be enabled for auto start by any of the following ways:
Stack definition:
/var/lib/ambari-server/resources/common-services/<service_name>/<version>/metainfo.xml specifies whether a component is enabled for auto start.
To enable a component for auto start in the stack definition, the XML snippet <recovery_enabled>true</recovery_enabled> should be specified. For example, to enable AMBARI_METRICS_COLLECTOR for auto start, it’s stack definition file common-services/AMBARI_METRICS/0.1.0/metainfo.xml should have the line in bold below:
<metainfo>
<schemaVersion>2.0</schemaVersion>
<services>
<service>
<name>AMBARI_METRICS</name>
<displayName>Ambari Metrics</displayName>
<version>0.1.0</version>
<comment>A system for metrics collection that provides storage and retrieval capability for metrics collected from the cluster
</comment>
<components>
<component>
<name>METRICS_COLLECTOR</name>
<displayName>Metrics Collector</displayName>
<category>MASTER</category>
<recovery_enabled>true</recovery_enabled>
Blueprint definition:
When using blueprint deployments, the components specified in the blueprint JSON will override the ones specified in the stack definition.
3. UI based deployments
Based on the stack definition, while deploying a cluster using the UI, the servicecomponentdesiredstate table’s new field recovery_enabled is updated by the backend with true/false based on whether the component is enabled or disabled for auto start.
Changes to the auto start value of one or more components is done from the UI. The changes will be updated in servicecomponentdesiredstate table (recovery_enabled column) which is the source of truth when the ambari server communicates with the ambari agent.
Blueprint schema
Use cluster-env section in the blueprint JSON to specify cluster specific auto start attributes.
JSON for enabling auto start:
"settings" : [ { "recovery_settings" : [ { "recovery_enabled" : "true" } ]}, { "service_settings" : [ { "name" : "HDFS", "recovery_enabled" : "false" }, { "name" : "TEZ", "recovery_enabled" : "false" } ]}, { "component_settings" : [ { "name" : "DATANODE", "recovery_enabled" : "true" } ] } ] |
---|
Blueprint processor hands off this list to the deployment module so that servicecomponentdesiredstate table can be updated.
Component autostart hierarchy
Stack definition will contain the default list of components to be enabled or disabled.
Blueprint definition can use the cluster-env section to specify a list which will override the one specified in the stack definition.
UI will get it's list from the stack definition.
The backend will update the servicecomponentdesiredstate table with the list coming in from the UI or Blueprint.
Ambari Metric Service specific changes
METRICS_COLLECTOR component is set to auto start by default in ambari.properties in Ambari versions earlier to 2.4.0. In 2.4.0, this setting has been migrated to /var/lib/ambari-server/resources/common-services/AMBARI_METRICS/<version>/metainfo.xml with the <recovery_enabled>true</recovery_enabled> entry.
Backward compatibility
Ambari.properties will be ignored. All values come from either the stack definition for UI based deployments or blueprint for blueprint based deployments. Cluster-env.xml or the cluster-env section of the blueprint supplies the auto start properties listed above.
Pre-populate settings in the DB: The backend will populate the servicecomponentdesiredstate table with true/false values for various components during deployment - coming from the stack deployment or blueprint.
Communication
The ambari agent communicates with ambari server during registration (start up) and with periodic heartbeats. These are events when the server can send information to the agent when there are changes to the auto start property on services and components, giving an opportunity to the agent to apply those changes.
Registration
The server sends the following JSON to the agent during registration.
{
"recoveryConfig":
{
"type" : "AUTO_START",
"maxCount" : "5",
"windowInMinutes" : 20,
"retryGap" : 2,
"maxLifetimeCount" : 5,
"components”: “METRICS_COLLECTOR, OOZIE_SERVER”
}
}
The components member contains a list of components enabled for auto start and not in maintenance mode.
Heartbeat
If the auto start value for one or more components changes and/or the cluster-env level recovery properties change, the above JSON is constructed with the changed components and sent to the agent during the subsequent heartbeat.
Database
Cluster specific properties
The following cluster level properties will be stored under the cluster-env type in clusterconfig table as a JSON:
Property name | Value(s) | Description |
recovery_type | DEFAULT, AUTO_START | DEFAULT: No auto start. AUTO_START: auto start only. |
recovery_lifetime_max_count | ||
recovery_max_count | ||
recovery_window_in_minutes | ||
recovery_retry_interval | ||
recovery_enabled | true, false | Cluster level recovery |
Cluster config table:
cluster_id | type_name | version_tag | version | config_data |
2 | cluster-env | version1 | 1 | {...,"recovery_lifetime_max_count":"1024","recovery _max_count":"6","recovery_type":"AUTO_START",,"recovery_retry_interval":"5"} |
The recovery_enabled value from clusterconfig overrides the value from servicecomponentdesiredstate for that cluster.
Service component specific properties
The servicecomponentdesiredstate table will be used to specify whether a component is enabled for auto start or not. Columns in bold are new. Existing attributes in ambari.properties are mapped to the new columns here.
recovery.disabled_components/recovery.enabled_components → recovery_enabled (boolean)
cluster_id | component_name | service_name | recovery_enabled |
2 | YARN_CLIENT | YARN | 0 |
2 | METRICS_COLLECTOR | AMBARI_METRICS | 1 |
2 | OOZIE_SERVER | OOZIE | 1 |
REST API
Get auto-start flags of a cluster
Type: GET
Request: api/v1/clusters/<cluster_name>?fields=Clusters/desired_configs/cluster-env
{
"href" : "http://c6404.ambari.apache.org:8080/api/v1/clusters/testcluster?fields=Clusters/desired_configs/cluster-env",
"Clusters" : {
"cluster_name" : "testcluster",
"version" : "HDP-2.2",
"desired_configs" : {
"cluster-env" : {
"tag" : "version1",
"user" : "admin",
"version" : 1
}
}
}
}[
Type: GET
Request: api/v1/clusters/<cluster_name>/configurations?type=cluster-env&tag=version<xxx>
Example Response:
{
href: "...",
items: [
{
href: "...",
tag: "version<xxx>",
type: "cluster-env",
version: 2,
Config: {
cluster_name: "c1",
stack_id: "HDP-2.3"
},
properties: {
fetch_nonlocal_groups: "true",
ignore_groupsusers_create: "false",
kerberos_domain: "EXAMPLE.COM",
override_uid: "true",
repo_suse_rhel_template: "...",
repo_ubuntu_template: "{{package_type}} {{base_url}} {{components}}",
security_enabled: "false",
smokeuser: "ambari-qa",
smokeuser_keytab: "/etc/security/keytabs/smokeuser.headless.keytab",
user_group: "hadoop",
recovery_enabled: “false”,
recovery_type: “AUTO_START”,
recovery_lifetime_max_count: 10,
recovery_max_count: 2,
recovery_window_in_minutes: 10,
recovery_retry_interval: 5000
}
}
]
}
Set auto-start flags of a cluster
Type: PUT
Request: api/v1/clusters/<cluster_name>
{
Clusters: {
desired_config: {
tag: "version<xxx>",
type: "cluster-env",
properties: {
fetch_nonlocal_groups: "true",
ignore_groupsusers_create: "false",
kerberos_domain: "EXAMPLE.COM",
override_uid: "true",
repo_suse_rhel_template: "...",
repo_ubuntu_template: "...",
security_enabled: "false",
smokeuser: "ambari-qa",
smokeuser_keytab: "...",
user_group: "hadoop",
recovery_enabled: “true”,
recovery_type: “AUTO_START”,
recovery_lifetime_max_count: 10,
recovery_max_count: 2,
recovery_window_in_minutes: 10,
recovery_retry_interval: 5000
}
}
}
}
Get auto-start flags of all components
Type: GET
Request: api/v1/clusters/<cluster_name>/components?fields=ServiceComponentInfo/component_name,ServiceComponentInfo/service_name,ServiceComponentInfo/category,ServiceComponentInfo/recovery_enabled
Success Response: 200 - application/json
Example Response
{
href: "...",
items: [
{
href: "...",
ServiceComponentInfo: {
category: "SLAVE",
cluster_name: "c1",
component_name: "DATANODE",
service_name: "HDFS",
recovery_enabled: “true”
}
},
{
href: "...",
ServiceComponentInfo: {
category: "MASTER",
cluster_name: "c1",
component_name: "NAMENODE",
service_name: "HDFS",
recovery_enabled: “true”
}
},
{
href: "...",
ServiceComponentInfo: {
category: "SLAVE",
cluster_name: "c1",
component_name: "JOURNALNODE",
service_name: "HDFS",
recovery_enabled: “false”
}
}
]
}
Error Response: 400 - Bad Request
{
"status" : <status>,
"message" : <error message>
}
Set auto-start flags of all components
Type: PUT
Request 1: api/v1/clusters/<cluster_name>/components?ServiceComponentInfo/component_name.in(<enabled_component_names>)
Request Params: application/json
{
ServiceComponentInfo: {
recovery_enabled: “true”
}
}
Request 2:
api/v1/clusters/<cluster_name>/components?ServiceComponentInfo/component_name.in(<disabled_component_names>)
Request Params: application/json
{
ServiceComponentInfo: {
recovery_enabled: “false”
}
}
Success Response: 202 OK
Error Response: 400 - Bad Request
{
"status" : <status>,
"message" : <error message>
}
Request 3:
api/v1/clusters/testcluster/components/ZOOKEEPER_SERVER -d '{"ServiceComponentInfo" : {"recovery_enabled":"true"}}'
Request 4:
api/v1/clusters/testcluster/components?ServiceComponentInfo/component_name=ZOOKEEPER_SERVER -d '{"ServiceComponentInfo" : {"recovery_enabled":"false"}}'
Request 5:
curl -u admin:admin -H "X-Requested-By: ambari" -X PUT 'http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/components?ServiceComponentInfo/component_name.in(ZOOKEEPER_SERVER)' -d '{"ServiceComponentInfo" : {"recovery_enabled":"false"}}'
Request 6:
curl -u admin:admin -H "X-Requested-By: ambari" -X PUT http://c6401.ambari.apache.org:8080/api/v1/clusters/testcluster/components -d '{"RequestInfo": {"query": "ServiceComponentInfo/component_name.in(ZOOKEEPER_CLIENT,ZOOKEEPER_SERVER)"} , "ServiceComponentInfo" : {"recovery_enabled":"true"}}'