Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Broker

...

Statistics

...

The

...

following

...

statistics

...

are

...

provided

...

as

...

a

...

basic

...

useful

...

minimum.

...

  1. Peak

...

  1. bytes

...

  1. per

...

  1. second

...

  1. Current

...

  1. bytes

...

  1. per

...

  1. second

...

  1. Total

...

  1. bytes

...

  1. Peak

...

  1. messages

...

  1. per

...

  1. second

...

  1. Current

...

  1. messages

...

  1. per

...

  1. second

...

  1. Total

...

  1. messages

...

These

...

statistics

...

will

...

be

...

generated

...

for

...

both

...

delivered

...

and

...

received

...

messages,

...

per

...

connection

...

and

...

aggregated

...

per

...

virtualhost

...

and

...

also

...

for

...

the

...

entire

...

broker.

...

Delivered

...

messages

...

are

...

those

...

that

...

have

...

been

...

sent

...

to

...

a

...

client

...

by

...

the

...

broker,

...

and

...

may

...

be

...

counted

...

more

...

than

...

once

...

die

...

to

...

rollbacks

...

and

...

re-delivery.

...

Received

...

messages

...

are

...

those

...

that

...

are

...

published

...

by

...

a

...

client

...

and

...

received

...

by

...

the

...

broker.

Mechanism

The statistics are generated using a sample counter, which is thread-safe and is triggered on incoming message events. The totals for messages and data (bytes) are recorded as type long and the rates are recorded as double since they may be fractional.

The default sample period is one second, during which events are recorded cumulatively. This can be changed for all counters in the broker by setting the system property qpid.statistics.samplePeriod. Basically, rates are calculated as an average over this sample period, so if the traffic is very bursty, a small sample period will fail to capture events most of the time, and you will end up with inflated peak rates. Since we are calculating rates per-second, the sample period is 1000ms as a default, but if this is increased to, say, 10000ms then messages will be counted over ten second periods, and the resulting totals divided by ten to give the rate.

Info

Since qpid.statistics.samplePeriod is the cut-off point when the rates are calculated, stored, and then re-set, changing this property should affect performance. Increasing the value should improve throughput, decreasing should reduce it.

When all statistics generation is enabled, there will be 12 counters that are used, since statistics are calculated for the product of the following sets:

{ Broker, Virtualhost, Connection } x { Delivery, Receipt } x { Messages, Data }

Reporting

If statistics are being generated for the broker or for virtualhosts then periodic reporting can be enabled. This is generated at specified intervals, with the option to reset the statistics after outputting the current data.

No Format


h2. Mechanism

The statistics are generated using a sample counter, which is thread-safe and is triggered on incoming message events. The totals for messages and data (bytes) are recorded as type {{long}} and the rates are recorded as {{double}} since they may be fractional.

The default sample period is one second, during which events are recorded cumulatively. This can be changed for *all* counters in the broker by setting the system property {{qpid.statistics.samplePeriod}}. Basically, rates are calculated as an average over this sample period, so if the traffic is very bursty, a small sample period will fail to capture events most of the time, and you will end up with inflated peak rates. Since we are calculating rates per-second, the sample period is 1000ms as a default, but if this is increased to, say, 10000ms then messages will be counted over ten second periods, and the resulting totals divided by ten to give the rate.

{info}
Since {{qpid.statistics.samplePeriod}} is the cut-off point when the rates are calculated, stored, and then re-set, changing this property _should_ affect performance. Increasing the value should improve throughput, decreasing should reduce it.
{info}

When all statistics generation is enabled, there will be *12* counters that are used, since statistics are calculated for the product of the following sets:

\{ Broker, Virtualhost, Connection \} x \{ Delivery, Receipt \} x \{ Messages, Data \}

h2. Reporting

If statistics are being generated for the broker or for virtualhosts then periodic reporting can be enabled. This is generated at specified intervals, with the option to reset the statistics after outputting the current data.

{noformat}
[Statistics-Reporting] BRK-1008 : received : 4.395 kB/s peak : 4500 bytes total
[Statistics-Reporting] BRK-1008 : delivered : 4.395 kB/s peak : 4500 bytes total
[Statistics-Reporting] BRK-1009 : received : 45 msg/s peak : 45 msgs total
[Statistics-Reporting] BRK-1009 : delivered : 45 msg/s peak : 45 msgs total
[Statistics-Reporting] VHT-1003 : localhost : received : 1.465 kB/s peak : 1500 bytes total
[Statistics-Reporting] VHT-1003 : localhost : delivered : 1.465 kB/s peak : 1500 bytes total
[Statistics-Reporting] VHT-1004 : localhost : received : 15 msg/s peak : 15 msgs total
[Statistics-Reporting] VHT-1004 : localhost : delivered : 15 msg/s peak : 15 msgs total
[Statistics-Reporting] VHT-1003 : test : received : 0.977 kB/s peak : 1000 bytes total
[Statistics-Reporting] VHT-1003 : test : delivered : 0.977 kB/s peak : 1000 bytes total
[Statistics-Reporting] VHT-1004 : test : received : 10 msg/s peak : 10 msgs total
[Statistics-Reporting] VHT-1004 : test : delivered : 10 msg/s peak : 10 msgs total
[Statistics-Reporting] VHT-1003 : development : received : 1.953 kB/s peak : 2000 bytes total
[Statistics-Reporting] VHT-1003 : development : delivered : 1.953 kB/s peak : 2000 bytes total
[Statistics-Reporting] VHT-1003 : development : received : 1.953 kB/s peak : 2000 bytes total
[Statistics-Reporting] VHT-1003 : development : delivered : 1.953 kB/s peak : 2000 bytes total
[Statistics-Reporting] VHT-1004 : development : received : 20 msg/s peak : 20 msgs total
[Statistics-Reporting] VHT-1004 : development : delivered : 20 msg/s peak : 20 msgs total
{noformat}
*Sample statistics report log messages*

The reported statistics are output using the operational logging mechanism, and can therefore be extracted from the log files using the same mechanisms as are currently used for other such messages. The new message types are as follows:

* *BRK-1008* - Broker data statistics (peak bytes per second and total bytes)
* *BRK-1009* - Broker message statistics (peak messages per second and total messages)
* *VHT-1003* - Virtualhost data statistics (peak bytes per second and total bytes)
* *VHT-1004* - Virtualhost message statistics (peak messages per second and total messages)

h1. Configuration

The statistics generation will be configured via the {{config.xml}} broker configuration mechanism.

{code:title=config.xml}

Sample statistics report log messages

The reported statistics are output using the operational logging mechanism, and can therefore be extracted from the log files using the same mechanisms as are currently used for other such messages. The new message types are as follows:

  • BRK-1008 - Broker data statistics (peak bytes per second and total bytes)
  • BRK-1009 - Broker message statistics (peak messages per second and total messages)
  • VHT-1003 - Virtualhost data statistics (peak bytes per second and total bytes)
  • VHT-1004 - Virtualhost message statistics (peak messages per second and total messages)

Configuration

The statistics generation will be configured via the config.xml broker configuration mechanism.

Code Block
titleconfig.xml
<statistics>
    <generation>
        <!-- default false -->
        <broker>true</broker>
        <virtualhosts>true</virtualhosts>
        <connections>true</connections>
    </generation>
    <reporting>
        <period>3600</period><!-- seconds -->
        <reset>true</reset>
    </reporting>
</statistics>
{code}

It

...

is

...

not

...

possible

...

to

...

enable

...

statistics

...

generation

...

for

...

a

...

single

...

virtualhost

...

-

...

all

...

virtualhosts

...

will

...

generate

...

statistics

...

if

...

the

...

//statistics/generation/virtualhosts

...

element

...

is

...

set

...

to

...

true

...

.

...

It

...

only

...

makes

...

sense

...

to

...

enable

...

reporting

...

if

...

either

...

broker

...

or

...

virtualhosts

...

have

...

statistics

...

generation

...

enabled

...

also.

...

However,

...

the

...

broker

...

will

...

simply

...

ignore

...

the

...

reporting

...

configuration

...

if

...

no

...

statistics

...

are

...

being

...

generated.

...

If

...

the

...

//statistics/reporting/reset

...

element

...

is

...

set

...

to

...

true

...

then

...

after

...

reporting

...

on

...

the

...

statistics

...

in

...

the

...

log,

...

the

...

statistics

...

will

...

be

...

reset

...

to

...

zero

...

for

...

the

...

entire

...

broker.

...

Even

...

if

...

statistics

...

generation

...

is

...

completely

...

disabled,

...

it

...

is

...

still

...

possible

...

to

...

activate

...

statistics

...

on

...

an

...

individual

...

connection,

...

while

...

the

...

broker

...

is

...

running.

...

The

...

JMX

...

attribute

...

statisticsEnabled

...

on

...

a

...

connection

...

MBean

...

can

...

be

...

set

...

to

...

true

...

which

...

will

...

start

...

statistics

...

generation,

...

and

...

totals

...

and

...

rates

...

will

...

be

...

calculated

...

from

...

this

...

point

...

onward,

...

or

...

until

...

it

...

is

...

set

...

to

...

false

...

again.

...

Additionally,

...

the

...

following

...

two

...

system

...

properties

...

can

...

be

...

set

...

to

...

configure

...

the

...

statistics

...

counter:

...

  • qpid.statistics.samplePeriod

...

  • -

...

  • sample

...

  • period

...

  • in

...

  • ms

...

  • -

...

  • default

...

  • 1000
  • qpid.statistics.disable

...

  • -

...

  • override

...

  • configuration

...

  • and

...

  • disable

...

  • statistics

...

  • collection

...

  • -

...

  • default

...

  • false

...

  • (not

...

  • set)

...

These

...

two

...

properties

...

are

...

exposed

...

on

...

the

...

StatisticsCounter

...

class

...

as

...

static

...

fields

...

DEFAULT_SAMPLE_PERIOD

...

and

...

DISABLE_STATISTICS

...

.

Design

StatisticsCounter Class

This implements the counting of event data and generates the total and rate statistics. There should be one instance of this class per type of statistic, such as messages or bytes. The instance methods that are called to add an event are:

  • void registerEvent()
    Registers a single event at the current system time, such as the arrival of a message.
  • void registerEvent(long value)
    Registers a cumulative value, such as the size of a message, at the current system time.
  • void registerEvent(long value, long timestamp)
    Registers a cumulative value at the specified time. As with the previous method, this could be 1, which will allow timestamped single events to be registered if the no-argument version is not sufficient.

There are three constructors:

  • StatisticsCounter()
  • StatisticsCounter(String name)
  • StatisticsCounter(String name, long samplePeriod)

These are chained in that order, using a default name of counter and the default sample period of 2000 ms or set in the qpid.statistics.samplePeriod property.

To retrieve the data, there are methods to return the current rate, peak rate and total, as well as the start time, sample period and name of the counter, and also a method to reset the counter.

StatisticsGatherer Interface

This is implemented by the broker business object that is generating statistics. It provides the following methods:

  • void initialiseStatistics()
  • void registerMessageReceived(long messageSize, long timestamp)
  • void registerMessageDelivered(long messageSize)
  • StatisticsCounter getMessageDeliveryStatistics()
  • StatisticsCounter getDataDeliveryStatistics()
  • StatisticsCounter getMessageReceiptStatistics()
  • StatisticsCounter getDataReceiptStatistics()
  • void resetStatistics()
  • void setStatisticsEnabled(boolean enabled)
  • boolean isStatisticsEnabled()

These statistics are exposed using the separate JMX Mbean interfaces detailed below, which calls these methods to retrieve the underlying StatisticsCounter objects and return their attributes. This interface gives a standard way for parts of the broker to set up and configure statistics generation.

When creating these objects, there should be a parent/child relationship between them, normally from broker to virtualhost to connection. This means that the lowest level gatherer can record statistics (if enabled) and also pass on the notification to its parent object to allow higher level aggregation of statistics. When resetting statistics, this works in the opposite direction, with higher level gatherers also resetting all of their child objects. Note that this parent/choild relationship is never explicitly specified, and is dependant on the implementation of registerMessageDelivery and resetStatistics in order to allow more flexibility.

JMX Interface

The Qpid JMX interface level that supports statistics is 1.9. Each object (MBean) that can generate statistics will add the following attributes and operations:

Operation

Description

resetStatistics

Reset the statistics counters for this object, and all children.

Attribute

Permissions

Description

peakMessageDeliveryRate

R

Peak number of messages delivered per second

peakDataDeliveryRate

R

Peak number of bytes delivered per second

messageDeliveryRate

R

Current number of messages delivered per second

dataDeliveryRate

R

Current number of bytes delivered per second

totalMessagesDelivered

R

Total number of messages delivered since start or reset

totalDataDelivered

R

Total number of bytes delivered since start or reset

peakMessageReceiptRate

R

Peak number of messages received per second

peakDataReceiptRate

R

Peak number of bytes received per second

messageReceiptRate

R

Current number of messages received per second

dataReceiptRate

R

Current number of bytes received per second

totalMessagesReceived

R

Total number of messages received since start or reset

totalDataReceived

R

Total number of bytes received since start or reset

statisticsEnabled

R (W Connections only)

Is statistics generation enabled or disabled

The following MBeans have had statistics attributes added:

  • ServerInformation
    The entire broker, statistics for all virtualhosts. The totals for the broker will be equal to the sum of the totals for all virtualhosts. Rates for the broker include messages from all virtualhosts, so the broker peak can be higher than individual virtualhosts, and possibly also higher than the sum, due to slight timing differences. This is to be expected.
    org.apache.qpid:type=ServerInformation,name=ServerInformation,

...

  • *
  • ManagedBroker
    This is just a misleadingly named virtualhost. Note that the sum of all connections statistics will not always equal the statistics for the virtualhost, as connection MBeans are discarded when the connection is closed.
    org.apache.qpid:type=VirtualHost.VirtualHostManager,VirtualHost=

...

  • *,

...

  • *
  • ManagedConnection
    Individual connections, which can be selected on a per-virtualhost

...

  • basis.

...

  • The

...

  • name

...

  • of

...

  • a

...

  • connection

...

  • is

...

  • usually

...

  • the

...

  • Java

...

  • toString()

...

  • representation

...

  • of

...

  • the

...

  • address

...

  • of

...

  • the

...

  • remote

...

  • internet

...

  • socket

...

  • making

...

  • the

...

  • connection,

...

  • so

...

  • it

...

  • is

...

  • usually

...

  • not

...

  • possible

...

  • to

...

  • determine

...

  • these

...

  • in

...

  • advance.

...

  • These

...

  • allow

...

  • enabling

...

  • and

...

  • disabling

...

  • of

...

  • statistics

...

  • generation.

...


  • org.apache.qpid:type=VirtualHost.Connection,VirtualHost=

...

  • *,name=

...

  • *

...

The

...

JMX

...

attributes

...

that

...

record

...

statistics

...

are

...

always

...

present,

...

and

...

will

...

have

...

a

...

value

...

of

...

0

...

/

...

0.0

...

if

...

generation

...

is

...

not

...

enabled,

...

and

...

the

...

statisticsEnabled

...

attribute

...

will

...

be

...

set

...

to

...

false

...

.

Note
{note} The {{

The

qpid.statistics.disable

}}

system

property

will

override

all

configuration

on

the

broker.

All

attributes

will

be

reported

as

_

0

_

/

_

0.0

_

and

JMX

will

show

statistics

as

being

disabled

via

the

*

statisticsEnabled

*

property,

although

it

will

be

possible

to

selectively

enable

generation

on

a

per-connection

basis.

{note} h2. Usage The JMX JConsole application can be used to view the attributes, both as discrete values or as graphs. Sample output is shown below, illustrating the virtualhost statistics for a broker with a producer and a consumer attached and sending messsages. !jconsole-statistics-zero.png|thumbnail! *Virtualhost statistics attributes at broker startup* !jconsole-statistics-graphs.png|thumbnail! *Virtualhost statistics graphs and attributes after sending and receiving messages through the broker* h1. Testing One caveat related to testing is that these statistics, apart from totals, are by their nature time dependent, and subject to race conditions, where sending messages as part of a test can overlap two sample periods, instead of one, causing seemingly incorrect results. h2. Unit The {{StatisticsCounter}} class is suitable for unit testing. Tests checking class creation and operation check the total, rate and peak data, as well as other class properties. It is difficult to write a test to provably show thread safety when updating the counter with new events, but there are tests to check that out-of-order events will not cause problems with the data generated. h2. System System testing covers the generation of statistics using a running broker, as well as the configuration mechanisms and the operation of the JMX interface to the data and the operational logging output. The following system tests have been written, all using a shared parent test case class, {{MessageStatisticsTestCase}} which sets up the JMX connection to the broker. {note} In order for these JMX based tests to work correctly with the _default_ test profile, a fix had to be back-ported from trunk which disables the use of a custom {{ServerSocketFactory}} in the JMX registry. The property that disables this is set only in the {{JMXUtils}} helper class, and should not be used in normal broker operation. {note} * {{MessageStatisticsTest}} * {{MessageStatisticsDeliveryTest}} * {{MessageStatisticsReportingTest}} * {{MessageConnectionStatisticsTest}} * {{MessageStatisticsConfigurationTest}} These tests can be run with the following command, from the _systests_ module directory: {{_$_ *ant test -Dtest=Message\*Statistics\*Test*}} h2. Performance To measure performance, a release of _2.6.0.7_ was generated, using the standard release mechanism. The data should be compared with the previous [results|2.6.0.6 Testing] for the _2.6.0.6_ release. The performance test suite was run on {{noc-qpiddev02.jpmorganchase.com}} with the broker running on {{noc-qpiddev01.jpmorganchase.com}}. Statistics were enabled on the broker by editing the {{etc/persistent_config.xml}} file to set statistics generation to _true_ for the broker, virtualhosts and connections, as well as to report every hour (without resetting the statistics). The _Disabled_ configuration used the provided default configuration file, and the _Broker Only_ configuration used the same as the first set of tests, but with the {{//statistics/generation/virtualhosts}} and {{//statistics/generation/connections}} elements set to _false_. The following command was executed for each of the configurations, and the results are aggregated into the tables below: {{_$_ *./Throughput.sh*}} {note:title=Accuracy} The measured standard deviation of these results is *2.2%* (about *500*) of the mean (about *23000*), based on ten sample runs of the _TTBT-TX_ test with no statistics generation enabled. For the _TTBT-NA_ test, a set of [102 test runs|^throughput.csv] indicated that the standard deviation was around *0.8%* of the mean, with a range of *3.8%* of the mean between the minimum and maximum result. {note} h3. Initial Results These results use the code from SVN revision r1033077, annd were generated on the 09 and 10 November 2010. h4. Throughput Numbers {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*Test*\\&nbsp;{td}{td:bgcolor=#eeeeff}*Thoughput*\\&nbsp;{td}{td:bgcolor=#eeeeff}*Throughput*\\Broker Only{td}{td:bgcolor=#eeeeff}*Throughput*\\Disabled{td}{tr} {tr}{td}PQBT-AA-Qpid-01{td}{td}2937.3007{td}{td}0{td}{td}3239.986{td}{tr} {tr}{td}PQBT-TX-Qpid-01{td}{td}3142.3297{td}{td}0{td}{td}3240.4037{td}{tr} {tr}{td}PTBT-AA-Qpid-01{td}{td}3817.1709{td}{td}3897.9127{td}{td}3941.363{td}{tr} {tr}{td}PTBT-TX-Qpid-01{td}{td}3971.9589{td}{td}0{td}{td}4087.7404{td}{tr} {tr}{td}TQBT-AA-Qpid-01{td}{td}52299.27{td}{td}0{td}{td}53195.957{td}{tr} {tr}{td}TQBT-NA-Qpid-01{td}{td}50979.767{td}{td}48090.19{td}{td}51696.26{td}{tr} {tr}{td}TQBT-TX-Qpid-01{td}{td}7321.2233{td}{td}0{td}{td}7438.8633{td}{tr} {tr}{td}TTBT-AA-Qpid-01{td}{td}86934.654{td}{td}0{td}{td}92742.66{td}{tr} {tr}{td}TTBT-NA-Qpid-01{td}{td}103829.22{td}{td}0{td}{td}128478.24{td}{tr} {tr}{td}TTBT-TX-Qpid-01{td}{td}21685.606{td}{td}0{td}{td}22985.97{td}{tr} {table} There was no obvious difference between broker only and all statistics, so that testing was discontinued. h4. Comparison with previous release: {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*Test*\\&nbsp;{td}{td:bgcolor=#eeeeff}*2.6.0.6*\\&nbsp;{td}{td:bgcolor=#eeeeff}*2.6.0.7*\\&nbsp;{td}{td:bgcolor=#eeeeff}*Change*\\&nbsp;{td}{td:bgcolor=#eeeeff}*2.6.0.7*\\Disabled{td}{td:bgcolor=#eeeeff}*Change*\\Disabled{td}{tr} {tr}{td}PQBT-AA-Qpid-01{td}{td}3214.728{td}{td}2937.3007{td}{td}-8.63%{td}{td}3239.986{td}{td}0.79%{td}{tr} {tr}{td}PQBT-TX-Qpid-01{td}{td}3224.1871{td}{td}3142.3297{td}{td}-2.54%{td}{td}3240.4037{td}{td}0.50%{td}{tr} {tr}{td}PTBT-AA-Qpid-01{td}{td}4069.4785{td}{td}3817.1709{td}{td}-6.20%{td}{td}3941.363{td}{td}-3.15%{td}{tr} {tr}{td}PTBT-TX-Qpid-01{td}{td}4088.1968{td}{td}3971.9589{td}{td}-2.84%{td}{td}4087.7404{td}{td}-0.1%{td}{tr} {tr}{td}TQBT-AA-Qpid-01{td}{td}51871.49{td}{td}52299.27{td}{td}0.82%{td}{td}53195.957{td}{td}2.55%{td}{tr} {tr}{td}TQBT-NA-Qpid-01{td}{td}54278.095{td}{td}50979.767{td}{td}-6.08%{td}{td}51696.26{td}{td}-4.76%{td}{tr} {tr}{td}TQBT-TX-Qpid-01{td}{td}7496.785{td}{td}7321.2233{td}{td}-2.34%{td}{td}7438.8633{td}{td}-0.77%{td}{tr} {tr}{td}TTBT-AA-Qpid-01{td}{td}94392.1{td}{td}86934.654{td}{td}-7.90%{td}{td}92742.66{td}{td}-1.75%{td}{tr} {tr}{td}TTBT-NA-Qpid-01{td}{td}129089.54{td}{td}103829.22{td}{td}-19.57%{td}{td}128478.24{td}{td}-0.47%{td}{tr} {tr}{td}TTBT-TX-Qpid-01{td}{td}23134.289{td}{td}21685.606{td}{td}-6.26%{td}{td}22985.97{td}{td}-0.64%{td}{tr} {tr}{td:colspan=2}{td}{td:bgcolor=#eeeeff}*Average*{td}{td}*-6.84%*{td}{td:bgcolor=#eeeeff}*Average*{td}{td}*-0.77%*{td}{tr} {table} h4. Analysis {warning} As can be seen, this results in an average of around *7%* reduction in throughput with all statistics generation enabled, and less that *1%* reduction with them disabled. One anomaly that stands out is the _TTBT_NA_ test (transient, non-acknowledge messages) which shows a nearly *20%* decrease in throughput, which warrants further investigation. However this is obviously too great a penalty, and will require code changes and further investigation and testing. {warning} Further testing with different combinations of broker, virtualhost and connection statistics enabled will need to be carried out to determine where the greatest penalty in performance lies. Based on this, there may be changes required to the {{StatisticsCounter}} class, or the propagation mechanism that links child and parent counter objects. It is also important to determine what the maximum acceptable loss of throughput is to a customer. It would seem desirable to have less that *5%* decreate in throughput due to statistics generation, however it should be noted that a performance penalty is inevitable when adding code to the message delivery and receipt paths in the broker, and the aim is only to minimise this, not to eliminate it. h3. Further Testing The *7%* performance loss was felt to be unacceptable, so modifications were made to the {{StatisticsCounter}} code to reduce the time spend carrying out computations. The following changes were made: # The system property {{qpid.statistics.samplePeriod}} was changed from _1000_ to _2000_ and _5000_. # The code in {{StatisticsCounter}} was modified to defer more calculation to the end of a sample period. h4. Throughput Numbers The same _2.6.0.7_ broker with _QPID_OPTS_ set to {{-Dqpid.statistics.samplePeriod=N}} where _N_ is _2000_ or _5000_. {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*Test*\\&nbsp;{td}{td:bgcolor=#eeeeff}*2.6.0.6*\\&nbsp;{td}{td:bgcolor=#eeeeff}*2.6.0.7*\\2000{td}{td:bgcolor=#eeeeff}*Change*\\2000{td}{td:bgcolor=#eeeeff}*2.6.0.7*\\5000{td}{td:bgcolor=#eeeeff}*Change*\\5000{td}{tr} {tr}{td}PQBT-AA-Qpid-01{td}{td}3214.728{td}{td}3198.1702{td}{td}-0.52%{td}{td}2970.78{td}{td}-7.59%{td}{tr} {tr}{td}PQBT-TX-Qpid-01{td}{td}3224.1871{td}{td}3214.087{td}{td}-0.31%{td}{td}2875.1144{td}{td}-10.82%{td}{tr} {tr}{td}PTBT-AA-Qpid-01{td}{td}4069.4785{td}{td}4030.176{td}{td}-0.97%{td}{td}3872.7224{td}{td}-4.83%{td}{tr} {tr}{td}PTBT-TX-Qpid-01{td}{td}4088.1968{td}{td}4175.8924{td}{td}2.15%{td}{td}3962.1336{td}{td}-3.08%{td}{tr} {tr}{td}TQBT-NA-Qpid-01{td}{td}54278.095{td}{td}52546.314{td}{td}-3.19%{td}{td}48368.572{td}{td}-10.89%{td}{tr} {tr}{td}TTBT-TX-Qpid-01{td}{td}23134.289{td}{td}22086.33{td}{td}-4.53%{td}{td}18947.935{td}{td}-18.10%{td}{tr} {tr}{td:colspan=2}{td}{td:bgcolor=#eeeeff}*Average*{td}{td}*-1.23%*{td}{td:bgcolor=#eeeeff}*Average*{td}{td}*-9.22%*{td}{tr} {table} Old is an unmodified _2.6.0.7_ broker, the new broker has _QPID_OPTS_ set to {{-Dqpid.statistics.samplePeriod=5000}} and some changes to the {{StatisticsCounter}} class that move computation to the end of the sample period, all three have _QPID_JAVA_MEM_ set to {{-Xmx2g}}. Results are the average of three runs. {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*TTBT-NA-Qpid-01*{td}{td:bgcolor=#eeeeff}*Result*{td}{td:bgcolor=#eeeeff}*Change*{td}{tr} {tr}{td:bgcolor=#eeeeff}*2.6.0.6*{td}{td}113206{td}{td}0.0%{td}{tr} {tr}{td:bgcolor=#eeeeff}*2.6.0.7* Old{td}{td}103077{td}{td}-8.9%{td}{tr} {tr}{td:bgcolor=#eeeeff}*2.6.0.7* New 5s{td}{td}106202{td}{td}-6.1%{td}{tr} {tr}{td:bgcolor=#eeeeff}*2.6.0.7* New 10s{td}{td}104196{td}{td}-8.0%{td}{tr} {table} h3. More Tests Two 5 minute runs of a simple test, producing and consuming messages as fast as possible in a single thread. This was run on a HP DL585, Quad dual-core 2.2GHz Opteron, 8Gb, hence the difference in reported rates. * Statistics Configuration Enabled {{_$_ *QPID_OPTS=-Dqpid.statistics.samplePeriod=2500 QPID_JAVA_MEM=-Xmx4g ./qpid-server*}} 3792.9 msgs/s * Statistics Configuration Enabled {{_$_ *QPID_OPTS=-Dqpid.statistics.samplePeriod=500 QPID_JAVA_MEM=-Xmx4g ./qpid-server*}} 3961.8 msgs/s * Statistics Configuration Enabled {{_$_ *QPID_OPTS=-Dqpid.statistics.samplePeriod=30000 QPID_JAVA_MEM=-Xmx4g ./qpid-server*}} 3999.3 msgs/s * Statistics Configuration Enabled {{_$_ *QPID_OPTS=-Dqpid.statistics.disable=true QPID_JAVA_MEM=-Xmx4g ./qpid-server*}} 4156.3 msgs/s * Stock Apache 0.5 Download - http://apache.mirror.anlx.net//qpid/0.5/qpid-java-broker-0.5.tar.gz {{_$_ *QPID_JAVA_MEM=-Xmx4g ./qpid-server*}} 4068.3 msgs/s {info} These results would seem to indicate that the act of generating statistics is causing the slowdown, rather than updating the peak value at the end of a sample period. Additionally, the results for the *single-threaded* testing are much closer, suggesting that lock contention with many connections is an issue. {info} h3. Profiling To properly determine hot-spots in the code, a profiling run was made using the _YourKit_ profiler. The simple test used in the previous section was run with the profiler attached, and CPU statistics were collated. The profiler detected the following hotspots: * {{StatisticsCounter.registerEvent}} - *4%* * {{AtomicLong.addAndGet}} - *3%* * {{AtomicLong.compareAndSet}} - *2%* Since the {{AtomicLong}} calls are made inside {{registerEvent}}, it seems that this is where the CPU is spending the extra time that is causing the slowdown. In order to mitigate this, the {{registerEvent}} method was changed to be {{synchronized}} and the internal logic changed to use primitive {{long}} types and standard arithmentic operations. After this change, the profiler no longer showed the {{registerEvent}} as a hotspot in the code when another test run was made. The updated code was copied to the broker being used for throughput testing, and the *TTBT-NA-Qpid-01* test was re-run, with results shown below (again, an average of three runs was taken). {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*TTBT-NA-Qpid-01*{td}{td:bgcolor=#eeeeff}*Result*{td}{td:bgcolor=#eeeeff}*Change*{td}{tr} {tr}{td:bgcolor=#eeeeff}*2.6.0.6*{td}{td}113206{td}{td}0.0%{td}{tr} {tr}{td:bgcolor=#eeeeff}*2.6.0.7* Latest{td}{td}103698{td}{td}-8.3%{td}{tr} {table} Finally, the single-threaded test was also run against _2.6.0.6_ and the latest _2.6.0.7_ build, with the following results: {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*TTBT-NA-Qpid-01*{td}{td:bgcolor=#eeeeff}*Msgs/Second*{td}{td:bgcolor=#eeeeff}*Change*{td}{tr} {tr}{td:bgcolor=#eeeeff}*2.6.0.6*{td}{td}1714.5{td}{td}0.0%{td}{tr} {tr}{td:bgcolor=#eeeeff}*2.6.0.7* Latest{td}{td}1660{td}{td}-3.2%{td}{tr} {table} In order to more fully test this, the script which starts the _TTBT-NA_ test, and the *Throughput.sh* sctipt itself, were copied, and modified. A series of scripts, *STAT-*_XX_*.sh*, was created, consisting of identical calls to the _TTBT-NA_ test apart from the number of consumers, which varied from _01_ to _16_ and is represented by _XX_, and a *Statistics.sh* script was created which calls each of these in turn. The _STAT_ throughput tests were run against a standard _2.6.0.6_ broker and the _2.6.0.7_ build with all statistics generation enabled. It is expected that the difference in throughput should _increase_ as the number of consumers increases. Note that these tests were only run *once* so are not as accurate as the averaged, multi-run datasets. h4. 2.6.0.6 Throughput Numbers {{_$_ *QPID_JAVA_MEM=-Xmx2g ./qpid-server -c ../etc/persistent_config.xml*}} {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*Test*{td}{td:bgcolor=#eeeeff}*Thoughput*{td}{tr} {tr}{td}STAT-01{td}{td}57415.543{td}{tr} {tr}{td}STAT-02{td}{td}92730.82{td}{tr} {tr}{td}STAT-03{td}{td}105619.415{td}{tr} {tr}{td}STAT-04{td}{td}116192.78{td}{tr} {tr}{td}STAT-05{td}{td}119669.87{td}{tr} {tr}{td}STAT-06{td}{td}122700.195{td}{tr} {tr}{td}STAT-07{td}{td}111541.725{td}{tr} {tr}{td}STAT-08{td}{td}126887.405{td}{tr} {tr}{td}STAT-09{td}{td}110207.38{td}{tr} {tr}{td}STAT-10{td}{td}122272.66{td}{tr} {tr}{td}STAT-11{td}{td}123028.33{td}{tr} {tr}{td}STAT-12{td}{td}127724.23{td}{tr} {tr}{td}STAT-13{td}{td}114467.72{td}{tr} {tr}{td}STAT-14{td}{td}128505.86{td}{tr} {tr}{td}STAT-15{td}{td}116783.16{td}{tr} {tr}{td}STAT-16{td}{td}127264.14{td}{tr} {table} h4. 2.6.0.7 Throughput Numbers {{_$_ *QPID_OPTS=-Dqpid.statistics.samplePeriod=5000 QPID_JAVA_MEM=-Xmx2g ./qpid-server -c ../etc/persistent_config.xml*}} {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*Test*{td}{td:bgcolor=#eeeeff}*Thoughput*{td}{td:bgcolor=#eeeeff}*Change*{td}{tr} {tr}{td}STAT-01{td}{td}57101.006{td}{td}-0.5%{td}{tr} {tr}{td}STAT-02{td}{td}83016.304{td}{td}-10.5%{td}{tr} {tr}{td}STAT-03{td}{td}92533.17{td}{td}-12.4%{td}{tr} {tr}{td}STAT-04{td}{td}99060.04{td}{td}-14.8%{td}{tr} {tr}{td}STAT-05{td}{td}103271.49{td}{td}-13.7%{td}{tr} {tr}{td}STAT-06{td}{td}103339.02{td}{td}-15.8%{td}{tr} {tr}{td}STAT-07{td}{td}103619.804{td}{td}-7.1%{td}{tr} {tr}{td}STAT-08{td}{td}105613.2{td}{td}-16.8%{td}{tr} {tr}{td}STAT-09{td}{td}104110.99{td}{td}-5.5%{td}{tr} {tr}{td}STAT-10{td}{td}107108.84{td}{td}-12.4%{td}{tr} {tr}{td}STAT-11{td}{td}106436.25{td}{td}-13.5%{td}{tr} {tr}{td}STAT-12{td}{td}108537.865{td}{td}-15.0%{td}{tr} {tr}{td}STAT-13{td}{td}104056.22{td}{td}-9.1%{td}{tr} {tr}{td}STAT-14{td}{td}108337.49{td}{td}-15.7%{td}{tr} {tr}{td}STAT-15{td}{td}106900.96{td}{td}-8.5%{td}{tr} {tr}{td}STAT-16{td}{td}109222.84{td}{td}-14.2%{td}{tr} {table} Although it is difficult to draw any real or meaningful conclusions from these results, note that *single* consumer performance degradation is negligible. A useful further test would be to run the _2.6.0.7_ broker with only connection-level statistics being generated? This ought to remove the contention for the lock on the counters for virtualhosts and the broker when many consumers are being delivered to. h4. 2.6.0.7 Throughput Numbers {{_$_ *QPID_OPTS=-Dqpid.statistics.samplePeriod=5000 QPID_JAVA_MEM=-Xmx2g ./qpid-server -c ../etc/connection_config.xml*}} {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*Test*{td}{td:bgcolor=#eeeeff}*Thoughput*{td}{td:bgcolor=#eeeeff}*Change*{td} {tr}{td}STAT-01{td}{td}55884.613{td}{td}-2.7%{td}{tr} {tr}{td}STAT-02{td}{td}91857.41{td}{td}-0.9%{td}{tr} {tr}{td}STAT-07{td}{td}114754.22{td}{td}2.9%{td}{tr} {tr}{td}STAT-08{td}{td}117917.595{td}{td}-7.1%{td}{tr} {tr}{td}STAT-09{td}{td}114506.71{td}{td}3.9%{td}{tr} {tr}{td}STAT-10{td}{td}122330.33{td}{td}0.0%{td}{tr} {tr}{td}STAT-11{td}{td}118661.09{td}{td}-3.5%{td}{tr} {tr}{td}STAT-13{td}{td}116632.744{td}{td}1.9%{td}{tr} {tr}{td}STAT-15{td}{td}118917.79{td}{td}1.8%{td}{tr} {tr}{td}STAT-16{td}{td}120704.68{td}{td}-5.2%{td}{tr} {table} h2. Conclusion? It would appear that there is a penalty of a few percent (up to 4%) for a single thread, rising to over 10% when there are many threads. Further testing will be required to determine how the number of connections (i.e. threads) causes the performance to vary. This is, of course, hampered by the fact that the throughput tests exhibit up to 4% variability between the minimum and maximum results for a particular test. Additionally, it is unclear which of the two mechanisms ({{AtomicLong}} versus synchronized {{registerEvent}} method) used in the counter is preferable. For the many-threaded, multi-consumer/producer case the {{AtomicLong}} seems the better choice, for a single producer/consumer the synchronized {{registerEvent}} method is better. Another point to note is that the actual per-message latency increase is quite small, and it is only when sending large numbers of messages _using many connections_ that the degradation comes into play, and the ratio of messages to data is small, i.e. the *data* throughput does not concern us, it is purely the number of messages. This means that for most applications there should not be a noticeable difference in performance, with any small change in per message latency being eaten up in noise. Another point to note is that in these test garbage collection h3. Further Modifications This data was generated after modifying the broker code to place statistics recording for message *delivery* after the message is enqueued at the broker (or at least, after an attempt has been made to do so). {table:cellpadding=3|cellspacing=0|border=1} {tr}{td:bgcolor=#eeeeff}*Test*{td}{td:bgcolor=#eeeeff}*2.6.0.7*{td}{td:bgcolor=#eeeeff}*2.6.0.6*{td}{td:bgcolor=#eeeeff}*2.6.0.7*{td}{td:bgcolor=#eeeeff}*2.6.0.7*{td}{td:bgcolor=#eeeeff}*Disabled*{td}{tr} {tr}{td}PQBT-AA-Qpid-01{td}{td}2777.4448{td}{td}2696.0464{td}{td}2702.624{td}{td}2815.1379{td}{td}2730.4008{td}{tr} {tr}{td}PQBT-TX-Qpid-01{td}{td}3206.5027{td}{td}3147.2404{td}{td}3179.9607{td}{td}2701.3173{td}{td}3148.375{td}{tr} {tr}{td}PTBT-AA-Qpid-01{td}{td}4014.191{td}{td}3977.7153{td}{td}3902.3087{td}{td}3983.5536{td}{td}3996.9091{td}{tr} {tr}{td}PTBT-TX-Qpid-01{td}{td}4023.8338{td}{td}3996.622{td}{td}4079.2136{td}{td}3694.2487{td}{td}4006.0062{td}{tr} {tr}{td}TQBT-AA-Qpid-01{td}{td}47845.87{td}{td}51156.82{td}{td}52660.16{td}{td}45339.924{td}{td}51316.2{td}{tr} {tr}{td}TQBT-NA-Qpid-01{td}{td}48345.562{td}{td}47846.085{td}{td}48281.647{td}{td}49123.188{td}{td}47374.016{td}{tr} {tr}{td}TQBT-TX-Qpid-01{td}{td}6301.3253{td}{td}7262.565{td}{td}7209.0416{td}{td}6326.9844{td}{td}7234.8924{td}{tr} {tr}{td}TTBT-AA-Qpid-01{td}{td}78419.44{td}{td}89033.62{td}{td}79150.925{td}{td}80203.896{td}{td}90342.84{td}{tr} {tr}{td}TTBT-NA-Qpid-01{td}{td}97468.78{td}{td}110630.585{td}{td}97623.314{td}{td}98914.57{td}{td}120317.604{td}{tr} {tr}{td}TTBT-TX-Qpid-01{td}{td}18498.884{td}{td}22650.204{td}{td}21379.168{td}{td}18915.474{td}{td}22702.417{td}{tr} {table} As a cross-check, the tests were repeated using a stock _2.6.0.6_ broker, then again (twice actually), with the _2.6.0.7_ broker, like the first time, in case there are any environmental factors that are skewing the results. And finally, with the _2.6.0.7_ broker, but with statisatics generation disabled. These results show that the persistent testing has results that are about the same for both versions, whether the statistics are enbled or disabled. This is due to the naturally lower throughput causing less contention and allowing the statistics generation to run unimpeded. The transient tests get worse with statistics generation enabled, due to the much higher message throughput and the fact that may messages are trying to update the counters at the same time. h3. Latencey As a general recommendation, I would suggest that statistics generation is only enabled in cases of low message (count, not size) traffic, where the broker is not being 100% ustilised in terms of CPU. Persistant messaging is nescessarily I/O bound, interms of disk access, and this is usually the determining factor in message latency, so adding the statistics counters places no large additional burden, and will not add a large amount to the latency, certainly less that 5% degradation. It is obvious that *any* additional processing that talkes place during the act of delivery or receipt of a message will inevitably add latency and decrease throughput on the broker. If a broker is already operati ng at peak performance, something that is only possible with transient messages, which are only dependant on the amount of CPU and RAM available, then the enabling of statistics counters will have a greater effect, possibly degrading pewrformance by over 10%. h1. Tasks * Create generic data capture utility for gathering stats on received/delivered messages inc current/peak messages/sec and rolling total messages inc message count and bytes (payload only) stats, inc unit tests. Requires further discussion. * Configure stats capture (on/off) for broker (config.xml - new section) and per connection (jmx) and data extract dump interval, and code to dump stats data * Expose stats via JMX MBean attributes for query by users (??notification?? - reqs to be confirmed with client), add reset feature via JMX and tests for these features. JMX changes to expose new stats attributes and reset function, inc tests * Performance test to measure overhead of stats calculation (see RG email/example class) - separate from perf test harness, small class to measure calc cost only [QPID-2932|https://issues.apache.org/jira/browse/QPID-2932] Add statistics generation for broker message delivery [QPID-????|https://issues.apache.org/jira/browse/QPID-????] Port to trunk

Usage

The JMX JConsole application can be used to view the attributes, both as discrete values or as graphs. Sample output is shown below, illustrating the virtualhost statistics for a broker with a producer and a consumer attached and sending messsages.

Image Added
Virtualhost statistics attributes at broker startup

Image Added
Virtualhost statistics graphs and attributes after sending and receiving messages through the broker

Testing

One caveat related to testing is that these statistics, apart from totals, are by their nature time dependent, and subject to race conditions, where sending messages as part of a test can overlap two sample periods, instead of one, causing seemingly incorrect results.

Unit

The StatisticsCounter class is suitable for unit testing. Tests checking class creation and operation check the total, rate and peak data, as well as other class properties. It is difficult to write a test to provably show thread safety when updating the counter with new events, but there are tests to check that out-of-order events will not cause problems with the data generated.

System

System testing covers the generation of statistics using a running broker, as well as the configuration mechanisms and the operation of the JMX interface to the data and the operational logging output. The following system tests have been written, all using a shared parent test case class, MessageStatisticsTestCase which sets up the JMX connection to the broker.

Note

In order for these JMX based tests to work correctly with the default test profile, a fix had to be back-ported from trunk which disables the use of a custom ServerSocketFactory in the JMX registry. The property that disables this is set only in the JMXUtils helper class, and should not be used in normal broker operation.

  • MessageStatisticsTest
  • MessageStatisticsDeliveryTest
  • MessageStatisticsReportingTest
  • MessageConnectionStatisticsTest
  • MessageStatisticsConfigurationTest

These tests can be run with the following command, from the systests module directory:
$ ant test -Dtest=Message*Statistics*Test