Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Introduction:

Virtual router has running services which needs to run always until cloudsack disable it.

In VR if some service goes down currently there is no mechanism to alert the admin and

take action on the crashed services.

This feature is about monitoring the services rendered by the VR.

Goal for this feature is to monitor all the VR services and ensure they are running through the lifetime of VR

On service failure 

a)   Restart the service

b)   Generate an alert and event indicating failure

This monitoring VR services has two tasks.

  1. monitoring services in VR
  2. sending alerts from router to external receivers

Jira ticket:

https://issues.apache.org/jira/browse/CLOUDSTACK-4736

Monitoring services:

Services to be monitored in VR

  • dnsmasq
  • haproxy
  • sshd
  • apache webserver

Note: Monitoring process  can monitor only the services with daemons.

Design:

Cloudstack sends the config file of services to be monitored to  the router. Services like dnsmasq and haproxy are selected

if the service is selected in network offering.

The services sshd, webserver is selected by default from the DB.

New DB table:

table name: monitoring_services

...

isDefault: wether the service is monitored by default or not

Inside the VR there is python script which reads the config file and periodically checks the status of service.

The monitor script monitors only the service with pid file. If there are multiple processes with same name, monitoring

checks for the process which has pid in service pid file (Ex: /var/run/<servicename>.pid).

If the services is not running,  it recheck the status for 5 seconds in interval of 1 second. It the services still not running then 

the monitoring script do the following.

1. write syslog log about service fail and Restart the service.

2. If restart fails, writes a event log in in syslog.

3. A restart failed process is unmonitored for the next 30 minutes. After 30 minutes monitor tries to 

restart the service. 

The monitor script is added to crontab to run for every 3 minutes.

Supported VR networks:

1. Advanced zone Isolated networks

2. Basic zone shared network

3. Advanced zone shared network

sending alerts from router:

...

2. Also overload existing VR usage polling threads.

Note: This task is out of scope for the 4.3 release 

UI Changes:

No UI chagnes.

Supported Hypervisors:

xenserver, kvm, vmware

Upgrade:

Since this feature has new script files, router reboot is required for existing router.

References:

https://cwiki.apache.org/confluence/display/CLOUDSTACK/System+VMs+and+services+resiliency