The Myriad high availability (HA) feature provides no job failure or downtime in case of failure. In addition, self recovery from a failure is provided to restore it back to a highly available state after the failure.

A Myriad HA environment allows the Node Managers to reconnect to the new Resource Manager instance upon failover.


On failover, the following occurs:

  • Marathon re-launches the Resource Manager as a new task.
  • Mesos-DNS updates the IP address for the Resource Manager Mesos task to the new IP address.

All clients that are connected to Resource Manager continue to work as long as the FQDN (for example, rm.marathon.mesos) is used to connect to the Resource Manager.

Prerequisites

  • Deploy mesos-master, mesos-slave (per node), zookeeper, marathon, and mesos-dns on your cluster.

Setting up Mesos-DNS

Mesos-DNS is available on the Mesosphere GitHubFor an online version of the Mesos-DNS documents, see https://mesosphere.github.io/mesos-dns.

  1. Create a directory for Mesos-DNS. For example, /etc/mesos-dns.
  2. Install Mesos-DNS on one node in your cluster.
  3. Configure Mesos-DNS by providing the required parameters in the /etc/mesos-dns/config.json file. See the Mesos-DNS configuration documentation for more information. The following example parameters represent a minimum configuration.

    {
    	"zk": "zk:10.10.100.19:2181/mesos",
    	"refreshSeconds": 60,
    	"ttl": 60,
    	"domain": "mesos",
    	"port": 53,
    	"resolvers": ["10.10.1.10"],
    	"timeout": 5,
    }
  4. If you are on Linux, add the following Mesos-DNS name server to the /etc/resolv.conf file (at the top of the file) on all cluster nodes and clients. For example, clients running RM UI, Myriad UI, and so on.

    nameserver <mesos-dnsIP address>

Add the entries at the top (in the beginning) of the /etc/resolv.conf file. If the entries are not at the top, Mesos-DNS may not work correctly.

 

 

Configuring HA

Configuring Myriad for HA involves adding HA configuration properties to the $YARN_HOME/etc/hadoop/yarn-site.xml file and the $YARN_HOME/etc/hadoop/myriad-config-default.yml file. 

To the $YARN_HOME/etc/hadoop/yarn-site.xml file, add the following properties:

<!--  HA configuration properties -->

<property>
	<name>yarn.resourcemanager.store.class</name>
	<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.MyriadFileSystemRMStateStore</value>
</property>
<property>
	<name>yarn.resourcemanager.fs.state-store.uri</name>
           <!-- Path on HDFS, MapRFS etc -->
	<value>/var/mapr/cluster/yarn/rm/system</value>
</property>
<property>
	<name>yarn.resourcemanager.recovery.enabled</name>
	<value>true</value>
</property>
<!-- If using MapR distro
 <property>
	<name>yarn.resourcemanager.ha.custom-ha-enabled</name>
	<value>false</value>
 </property> -->

 

To the $YARN_HOME/etc/hadoop/myriad-config-default.yml file, modify the following values:

frameworkFailoverTimeout: <non-zero value>
haEnabled: true

The Myriad Mesos frameworkFailoverTimeout parameter is specified in milliseconds. This paramenter indicates to Mesos that Myriad will failover within this time interval.

 

 

 

  • No labels