Baseline Reports for Pre-4.x Performance Runs

Following are the Performance Run Results done pre-4.x.

CONFIGURATIONS

1. Management server

Processor

Dual core Intel(R) Xeon(R) CPU processor, 2.27GHz, ht enabled, 4 processor

Operating System

CentOS release 5.5 (Final), x86_64

Configuration Parameters

Following config parameters were used in both the management servers

- Java heap size = 5 GB

- db.cloud.maxActive = 250

- db.cloud.url.params=prepStmtCacheSize=517&cachePrepStmts=true&prepStmtCacheSqlLimit=4096&includeInnodbStatusInDeadlockExceptions=true&logSlowQueries=true

Java version

java version "1.6.0"

OpenJDK Runtime Environment (build 1.6.0-b09)

OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

2. Database

Processor

Quad-Core AMD Opteron(tm) Processor, 2.1GHz, ht enabled, 8 processor

Operating System

CentOS release 6.2 (Final), x86_64

Configuration Parameters

DB configurations for this run is detailed in the my.cnf attached: my.cnf

Mysql version

MySQL-server-5.5.21-1.linux2.6.x86_64

TEST ENVIRONMENT SET UP

Test Set up for this run consists of 1 zone with 1800 simulated hosts across over a hundred pods. 4000 accounts were created with each account having one network.

Following is the detailed configuration of the infrastructure:

1 Zone

112 Pods [Each Pod having 2 Clusters]

224 Clusters [Each cluster having 8 hosts and one primary storage]

1782 Hosts

4000 User accounts [Each account having one network]

12000 User instances

8000 Virtual Routers [Since we are using Redundant Virtual Router offering]

This run was carried out with induced delay using simulator for the following agent commands:

DhcpEntryCommand - 10s

CreateCommand - 20s

StartCommand - 20s

ClusterDeltaSyncCommand - 3s

PingCommand- 300 ms

PingTestCommand - 300 ms

CheckRouterCommand- 5 and 10s

ManageSnapshotCommand

BackupSnapshotCommand

TEST ENVIRONMENT SET UP

Test Set up for this run consists of 1 zone with 1800 simulated hosts across over a hundred pods. 4000 accounts were created with each account having one network.

Following is the detailed configuration of the infrastructure:

1 Zone

115 Pods [Each Pod having 2 Clusters]

230 Clusters [Each cluster having 8 hosts and one primary storage]

1840 Hosts

4000 User accounts [Each account having one network]

12000 User instances

8000 Virtual Routers [Since we are using Redundant Virtual Router offering]

USE CASES

Deploy VM
CPU Utilization
No. of DB Connections
Time for async job to complete
Time to return job id
Steady state Measures
CPU Utilization
No of DB Connections
Restart Management Server (agent load size 500, 1000, 1500)
Time to Stop MSTime taken to Start MS and rebalance hosts
Restart MS measures with Host in maintenance mode (agent load size 500, 1000, 1500)
Time to Stop
Time taken to Start MS and rebalance hosts
List* API Response Time
Creation of Snapshots for all VMs

RESULTS

Use case 1: Deploy VM

CPU UTILIZATION

Following graph shows the CPU Utilization for one of the management servers during deploying simulator VMs. Total time taken for all the VMs to complete deployment is ~3hrs.

No. OF DB CONNECTIONS

Following shows the number of DB connections to the mysql DB during Deploy VM.

Observation:

There are spikes every 8 mins (approx) on the No. of DB connections to almost 250 connections. The frequency of spikes increases with time

ASYNC JOB RESPONSE TIME

Following shows the time taken for Deploy VM Async Job to complete. Measures are derived from the DB for each job-id.

Observation:

With the number of VMs increasing, the time taken for the async job to complete is also more, longest time being 51 sec. As seen from the graph, the first few VMs took around 5-10 sec while the last VMs deployed (> 11000) took almost 50 sec to deploy.

TIME TAKEN BY ASYNC JOB TO RETURN JOB ID

This shows the time taken for the job id to return in response to the Deploy VM async job. The average time taken across Deploy VM API calls is 0.7 sec and the Median value is 0.418. This means, most API calls took < 0.418 sec to return the job id

Graphs for Deploy VM:

Use case 2: Steady State Measures

CPU UTILIZATION

The highlighted area shows the readings taken during Deploy VM. The graphs cover a total time of around 9 hours (including deploy VM which took ~ 3 hours)

MS RESTARTS

direct.agent.load.size	Time for all hosts to connect to MS2 after stopping MS1	Time for all hosts to get disconnected after stopping MS2	Time for all hosts to connect to MS1 after it is started	Time for rebalancing the hosts between the two MSs
500	460 s	135 s	120 s	265 s
1000	140 s	50 s	100 s	202 s

MS RESTARTS WITH HOSTS IN MAINTENANCE MODE

direct.agent.load.size	Time for all hosts to connect to MS2 after stopping MS1	Time for all hosts to get disconnected after stopping MS2	Time for all hosts to connect to MS1 after it is started	Time for rebalancing the hosts between the two MSs
500	135 s	52 s	110 s	213 s
1000	92 s	82 s	120 s	248 s

MEASURING THE DELAY BETWEEN SENDING AND EXECUTING AGENT COMMANDS

The delay between Sending... and Executing... for various agent commands was measured. The commands also had simulated delay induced. Following commands were measured:

DhcpEntryCommand

CreateCommand

StartCommand

CheckRouterCommand

ManageSnapshotCommand

BackupSnapshotCommand

The delay for all was well within 100 ms. At times, goes upto 400 ms

The VMs were deployed in steps of 3 iterations - 4K VMs each. Also set up recurring snapshots for 1000 Volumes.

Use case 5: List* API Response Time

Following are the results of a first attempt at measuring the List* API response time for few APIs:

Observations:

For all APIs, (except listVirtualMachines which was fixed lately) beyond pagesize of 5000, it’s taking too long (> 10 mins at times) to return the results. In many cases, I also get an error which says: “HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers”
Finding cases where json response is taking longer in some cases
For some cases like listStoragePools, although the total count = 224, the time taken for pagesize=5000 > pagesize=1000. Not sure why this should happen.
Randomly observed that on 8096, the calls take much longer to return. Is this expected? Shouldn't 8080 should take longer due to the authentication

The following tables shows an initial measure done for few APIs. For the cases where it failed with the error message: "HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers” the resut is marked "F"

API	pagesize	Response time XML in sec	Response Time - JSON in sec	Comments
listHosts - count:1794	100	6	12
	1000	117	114
	5000	160	196
	10000	175	176
	no pagesize	209	168
listVolumes-count:12K	100	6	5
	1000	104	85
	5000	F	833	Failed - XML
	10000	F	F	Failed
	no pagesize	Didn’t try	Didn’t try
listVirtualMachines-count:12K	100	2	2
	1000	35	14
	5000	193	145
	10000	330	269
	no pagesize
listRouters- count:8K	100	32	39
	1000	374	F	Failed
	5000	F
	10000	Didn’t try	Didn’t try
	no pagesize
listAccounts-count:4K	100	62	59
	1000		F	Failed
	5000	F		Failed
	10000	NA	NA	since count=4K
listUsers-count:4K	no pagesize	Didn’t try	Didn’t try
	100		13
	1000	49	37
	5000	136	74
	10000	NA	NA	since count=4K
	no pagesize	NA	NA	since count=4K
listAsyncJobs	100	6	11
	1000	68	96
	5000	F		Failed
	10000
	no pagesize
listStoragePools-count:224	100	2	5
	1000	15	7
	5000	25	32
	10000	NA	NA	since count=224
	no pagesize

Use case 6: Snapshots

This use case relates to Snapshots and the measures taken during snapshots being triggered by MS and the CPU Load during that time.

snapshot.poll.interval was set to default value of 300 sec.

Following are the results:

Hourly snapshots for 1000 Volumes## Snapshots were triggered for all 1000 VMs and the job ids were all generated within the 300 sec interval before the next poll begun.
Hourly Snapshots for 10000 Volumes## Snapshots were triggered for all 10000 Volumes. But the time taken was beyond 300 sec. So the polling continued only after all snapshots were triggered.

Following graph shows the CPU Utilization during snapshots being triggered (for 10000 volume case)

Space shortcuts

Child pages

CONFIGURATIONS

1. Management server

2. Database

TEST ENVIRONMENT SET UP

TEST ENVIRONMENT SET UP

USE CASES

Deploy VM

Steady state Measures

Restart Management Server (agent load size 500, 1000, 1500)

Restart MS measures with Host in maintenance mode (agent load size 500, 1000, 1500)

List* API Response Time

Creation of Snapshots for all VMs

RESULTS

Use case 1: Deploy VM

Use case 2: Steady State Measures

MS RESTARTS

MS RESTARTS WITH HOSTS IN MAINTENANCE MODE

MEASURING THE DELAY BETWEEN SENDING AND EXECUTING AGENT COMMANDS

Use case 5: List* API Response Time

Use case 6: Snapshots