Introduction:

Virtual router has running services which needs to run always until cloudsack disable it.

In VR if some service goes down currently there is no mechanism to alert the admin and

take action on the crashed services.

This feature is about monitoring the services rendered by the VR.

Goal for this feature is to monitor all the VR services and ensure they are running through the lifetime of VR

On service failure 

a)   Restart the service

b)   Generate an alert and event indicating failure

This monitoring VR services has two tasks.

  1. monitoring services in VR
  2. sending alerts from router to external receivers

Jira ticket:

https://issues.apache.org/jira/browse/CLOUDSTACK-4736

Monitoring services:

Services to be monitored in VR

  • dnsmasq
  • haproxy
  • sshd
  • apache webserver

Note: Monitoring process  can monitor only the services with daemons.

Design:

Cloudstack sends the config file of services to be monitored to  the router. Services like dnsmasq and haproxy are selected

if the service is selected in network offering.

The services sshd, webserver is selected by default from the DB.

New DB table:

table name: monitoring_services

Columns:

id,uuid: id and uuid

service  : General name of the service

process_name: service name in running processes list

service_name: Service is which is services path

service_path : Service path (Ex: /etc/init.d/<service>)

pidfile : path of the pid file

isDefault: wether the service is monitored by default or not

Inside the VR there is python script which reads the config file and periodically checks the status of service.

The monitor script monitors only the service with pid file. If there are multiple processes with same name, monitoring

checks for the process which has pid in service pid file (Ex: /var/run/<servicename>.pid).

If the services is not running,  it recheck the status for 5 seconds in interval of 1 second. It the services still not running then 

the monitoring script do the following.

1. write syslog log about service fail and Restart the service.

2. If restart fails, writes a event log in in syslog.

3. A restart failed process is unmonitored for the next 30 minutes. After 30 minutes monitor tries to 

restart the service. 

The monitor script is added to crontab to run for every 3 minutes.

Supported VR networks:

1. Advanced zone Isolated networks

2. Basic zone shared network

3. Advanced zone shared network

sending alerts from router:

Notifying log from VR to management server or external receivers needs to discussed and finalised.

One possible solution to send monitor logs from VR to MS is

1. polling the VR from the management server for logs.

2. Also overload existing VR usage polling threads.

Note: This task is out of scope for the 4.3 release 

UI Changes:

No UI chagnes.

Supported Hypervisors:

xenserver, kvm, vmware

Upgrade:

Since this feature has new script files, router reboot is required for existing router.

References:

https://cwiki.apache.org/confluence/display/CLOUDSTACK/System+VMs+and+services+resiliency

  • No labels

1 Comment

  1. 1. First line in the Introduction section says "Virtual router has running services which needs to run always until cloudsack disable it." What is the meaning of disable by cloudstack ? If cloudstack disables few services how the monitoring tool differentiate whether the service is disabled by cloudstack admin or its due to some failure?
    It means the services should run until cloudstack instruct to stop.
    The service disable/enable happens with network offering. on VR boot and monitor configuration get updated with new services. There are default services also.
    2. Is monitoring VR services is optional or will be monitored always? Any ways to set whether to enable this feature or not?
    Currently it is not configurable.By default monitoring default services like sshd, web server.
    3. Is service monitoring frequency configurable? If yes how do we configure? FS says the default value is 5 secs.
    No.
    4. FS says monitoring VR services has two tasks.
    1. monitoring services in VR
    2. sending alerts from router to external receivers
    What external receivers we will be supporting? Also please specify what all the ways the monitoring tool indicates the failure? Are we going to use exiting Cloudstack Alerts and Events framework to indicate the failure?
    This item will be updated once finalised about sending alerts from VR.
    5. If multiple instances of the same processes are running do we monitor all the instances of the same process?
    It monitors the parent service, which has its pid in pid file.
    6. After how many restarts the monitoring service decides that something is wrong with the process in bringing it up?
    five
    7. After N no.of restarts if the process is still not running are we going to remove it from the monitoring processes list? If yes how the tools informs the admin that it is not able to restart the process? Or it will be restarting the process forever?
    Unmonit process after N number re tries is not there.
    monitor log the service fail. Admin can knows only from the logs.
    For this release sending alerts from VR is not implemented.
    8. Is there way for the admin to specify the tool to monitor only particular services?
    Currently the services are selected based on network offering and default services from db.
    Configuring services from API/UI is not there.
    9. Apart from dnsmasq,haproxy,sshd,apache webserver services are we not monitoring the password service(socat)? Socat process is not mentioned in the Monitoring Services section in the FS
    Not monitoring socat because socat is automatically restarted by password server
    10. Is this supported in RVR case as well?
    No.
    11. Specify the hypervisors supported for this feature?
    xen,kvm and vmware
    12. As per my understanding this tool will be part of systemvm.iso. After upgrade from pre 4.3 release to 4.3 iso will be pushed to the hypervisors. So stop, start VR is required for the exiting VRs to get this service. Please confirm.
    yes