Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added section on slave repair.

...

Upon detecting a violation, the slave monitor will report this case to the mesos master on the newly extended HTTP endpoint which was described earlier.  This notification will result in a call to the RepairCoordinator to process the notification. 

...

Slave Repair

Before the coordinator can issue a repair it first must be made aware of a couple of things.

1) The total set of repairs that are available for the coordinator to choose from.

2) The order in which repairs should be attempted.  As an example, it's likely best to try restarting a host before attempting to re-image it given the cost in time of the repair.

3) The scope that the repair should take place at.  Some repairs are executed on the slave itself (such as rebooting) others may need to be executed by the master if the slave is unresponsive or incapable of receiving the repair request.

4) The number of concurrent repairs which are allowed.  This is to ensure that we don't have a rush of repairs that result in a service outage.

Given this, this design proposes that we add several new flags to the the master and slave processes.  These are as follows:

  • -repair_set=<repair_id>=<path_to_cmd>,<repair_id>=<path_to_cmd>
  • -repair_order=<repair_id>:<scope>,<repair_id>:<scope>
  • -num_allowed_repairs=<integer>

The first flag specified will be used to make an association between a specific repair_id (which must be unique) to a specific command to execute in order to issue the repair.  The second flag establishes both the order of the repairs but also the scope at which the repair should be executed.  If a repair_id is specified but cannot be found in the repair_set then a error will be issued.  The scope must be equal to either, "master" or "slave".  Any other value specified will result in an error at startup time.
The final flag is fairly self explanatory.  In short, it will be used to control the number of concurrent repairs that will be allowed to take place.  This will be taken into consideration by the RepairCoordinator before issuing repairs to slaves.  

Implementation Plan

It's proposed that this project be carried out in several stages.  

...