Monitoring VR services

Introduction:

Virtual router has running services which needs to run always until cloudsack disable it.

In VR if some service goes down currently there is no mechanism to alert the admin and

take action on the crashed services.

This feature is about monitoring the services rendered by the VR.

Goal for this feature is to monitor all the VR services and ensure they are running through the lifetime of VR

On service failure

a) Restart the service

b) Generate an alert and event indicating failure

This monitoring VR services has two tasks.

monitoring services in VR
sending alerts from router to external receivers

Jira ticket:

https://issues.apache.org/jira/browse/CLOUDSTACK-4736

Monitoring services:

Services to be monitored in VR

dnsmasq
haproxy
sshd
apache webserver

Note: Monitoring process can monitor only the services with daemons.

Design:

Cloudstack sends the config file of services to be monitored to the router. Services like dnsmasq and haproxy are selected

if the service is selected in network offering.

The services sshd, webserver is selected by default from the DB.

New DB table:

table name: monitoring_services

Columns:

id,uuid: id and uuid

service : General name of the service

process_name: service name in running processes list

service_name: Service is which is services path

service_path : Service path (Ex: /etc/init.d/<service>)

pidfile : path of the pid file

isDefault: wether the service is monitored by default or not

Inside the VR there is python script which reads the config file and periodically checks the status of service.

The monitor script monitors only the service with pid file. If there are multiple processes with same name, monitoring

checks for the process which has pid in service pid file (Ex: /var/run/<servicename>.pid).

If the services is not running, it recheck the status for 5 seconds in interval of 1 second. It the services still not running then

the monitoring script do the following.

1. write syslog log about service fail and Restart the service.

2. If restart fails, writes a event log in in syslog.

3. A restart failed process is unmonitored for the next 30 minutes. After 30 minutes monitor tries to

restart the service.

The monitor script is added to crontab to run for every 3 minutes.

Supported VR networks:

1. Advanced zone Isolated networks

2. Basic zone shared network

3. Advanced zone shared network

sending alerts from router:

Notifying log from VR to management server or external receivers needs to discussed and finalised.

One possible solution to send monitor logs from VR to MS is

1. polling the VR from the management server for logs.

2. Also overload existing VR usage polling threads.

Note: This task is out of scope for the 4.3 release

UI Changes:

No UI chagnes.

Supported Hypervisors:

xenserver, kvm, vmware

Upgrade:

Since this feature has new script files, router reboot is required for existing router.

References:

https://cwiki.apache.org/confluence/display/CLOUDSTACK/System+VMs+and+services+resiliency

Space shortcuts

Child pages