Steps to Add a Meter
- Clarify the Purpose of the Meter
- Define the Measurement
- Design the Meter
- Evaluate Sources of Information
- Instrument the Code
Each step is described more fully in the sections below.
We strongly advise making every attempt to complete each step before proceeding to the next.
Rationale.
- It is very, very tempting to assume that, because an existing method has a name similar to the attribute we want to measure, the method is "obviously" the best place to measure the attribute.
- It is very, very tempting to assume that, because an existing stat has a name similar to the attribute we want to measure, the stat "obviously" measures exactly the right attribute.
We have made each of these assumptions numerous times. Each time, we came to regret the assumption. The "obvious" place to measure often turns out to be incomplete, incorrect, or otherwise inappropriate.
We acknowledge that it can be difficult or impossible to complete each step before proceeding to the next. We strongly advise making the attempt.
Clarify the Purpose of the Meter
Clarify the purpose for adding this meter to Geode.
- Who is the audience for the measurement?
- What goals will the measurement help them achieve?
- How does the measurement help them achieve these goals?
Rationale. Clarifying the purpose of the meter will help you:
- Name and describe the meter.
- Determine what tags to add to the meter to allow the audience to filter and sort measurements.
- Identify and evaluate potential instrumentation sites.
- Ensure that measurements are relevant to your audience's needs, rather than merely easy to measure.
Define the Measurement
Describe as precisely as possible the attribute measured by the meter.
Key questions:
- What attribute does meter measure?
- What operations, events, or conditions cause the attribute to change?
- In what scope (see below) does the meter measure this attribute?
- What conditions (see below) govern whether to measure the attribute and how to report the measurements?
Audience focus. Define the measurement entirely in terms that the audience understands. Define the measurement in such a way that the audience can easily understand which measurements relate most directly to their current goals, questions, and challenges.
Rationale. Answering these questions will help you:
- Name and describe the meter.
- Define the units in which the meter reports measurements.
- Identify tags to add to the meter.
- Identify and evaluate potential instrumentation sites.
- Ensure that measurements are accurate and meaningful, rather than merely easy to measure.
Define the Scope of the Measurement
When you define a measurement, clearly identify the key scope or scopes in which the measurement is made. Look for two common kinds of scopes:
- The entity about which the measurement is made.
- The boundaries within which the measurement is made.
The entity. Each measurement is about some entity. Each meter measures some attribute of that entity or on behalf of that entity or in relation to that entity.
Example: Each geode.cache.entries
gauge reports the number of entries in a particular region. The measurement is about that region. The gauge includes a region tag that identifies which region it measures.
Boundaries. Each meter measures within one or more boundaries of interest to your audience.
Example: Each geode.cache.entries
gauge measures within several boundaries:
- The region that holds the entries counted by the meter.
- The cache server in which the region holds the entries counted by the meter.
- The host on which the server is running.
- The cluster in which the server is a member.
As the example shows, there are several kinds of boundaries to consider:
- The region is an example of an entity boundary.
- The cache server is an example of a process boundary.
- The host is an example of a hardware or virtual machine boundary.
- The cluster is an example of a conceptual or domain boundary.
The example also shows that:
- Boundaries may be nested. A given host may encompass several cache servers.
- Boundaries may overlap. A region holds entries across numerous servers, and a server may hold entries for numerous regions.
- Some boundaries serve to uniquely identify the attribute being measured. The cluster is an essential part of the identity of a region.
Audience focus. Identify scopes of interest to your audience—those scopes that your audience may wish to use to select and sort measurements for display and analysis. Of particular interest are the scopes necessary to uniquely identify the attribute being measured.
Rationale. Defining the scope of the measurement will help you:
- Name and describe the meter.
- Identify tags to add to the meter.
- Identify and evaluate potential instrumentation sites.
Define the Conditions of the Measurement
You may wish to report measurements selectively, either by reporting a measurement only in certain circumstances, or by reporting a given measurement differently in different circumstances.
Key questions:
- Under what conditions do you want to measure the attribute?
- Under what conditions do you want to report the measurement.
- What conditions govern which meter to use to report the measurement?
Deciding whether to measure. You may wish to measure the attribute (or whether to report a measurement) only under certain conditions.
Example: As we initially defined the geode.function.executions
timer, we intended to report only executions of user-defined functions, and not functions defined internally by Geode. Though we have not implemented this distinction, it is an example of the kind of distinction we considered.
Choosing among meters. You may wish to create multiple meters for the same attribute, and select among them to record measurements in different circumstances.
Example: Geode defines two geode.cache.gets
timers for each region. One timer reports cache hits, and one reports cache misses. Together these two meters report all get operations on the region.
Example: Geode defines two geode.function.executions
timers for each function. One timer reports successful executions, and one reports failed executions. Together these two meters report all executions of the function.
Rationale. Defining the selection criteria for the measurement will help you:
- Name and describe the meter.
- Identify tags to add to the meter.
- Identify and evaluate potential instrumentation sites.
Design the Meter
- Select the type of meter to use to record and report measurements
- Name the meter
- Describe the meter
- Identify the unit of measure reported by the meter
- Define tags that identify the scope, circumstances, and other details of the meter's measurements
Select the Type of Meter
Select the general type of meter you want to use to report measurements:
- A gauge reports a quantity that can go up or down. Example: The number of entries in a region.
- A counter reports a quantity that can only go up. Example: The number of gateway events received by a gateway receiver.
- A timer reports the number and durations of completed tasks, operations, and other events. Example: The number and durations of get operations processed by a server.
Select the category of meter that best suits the nature of the measurement.
The Micrometer library defines Java interfaces and classes that represent several variations of these categories. For details, see Instrument the Code, below.
Name the Meter
Identify the attribute. Name each meter in a way that clearly identifies the attribute it measures.
Example: jvm.memory.used
identifies that the gauge reports some amount of JVM memory used.
Example: geode.function.executions
identifies that the timer reports the number and durations of function executions.
Example: geode.cache.entries
identifies that the gauge reports a number of entries.
Consider (with caution) identifying the entity type. Consider including the entity type in the name, though it is often (or usually) better to omit it.
Example: geode.function.executions
identifies that the meter reports executions of a function. Executions is the attribute being reported. Function is the type of entity whose executions are being reported.
Before including the entity type in the meter name, consider:
- You will also likely want a tag that identifies the particular entity.
- The tag's key will likely be exactly the same word or words (e.g. region) that you would include in the meter name.
- If that tag makes the scope of the measurement sufficiently clear, then including the entity type in the meter name would be redundant.
Example: We considered (and rejected) geode.cache.region.entries
, which would identify that the meter reports not on the cache as a whole, but on a particular region. In the end, we decided that the region
tag sufficed to identify the kind of entity whose entry count the meter reports.
Style. After reviewing the naming conventions of meters packaged with Micrometer, we have adopted these style guidelines for naming meters:
- Brevity. Name the meter using as few words as possible without sacrificing clarity.
- Prefix. Start the meter's name with the prefix
geode
to indicate that the meter reports a geode-specific attribute. - Multiple words. Separate words with dots.
- Capitalization. Spell each word using only lower case letters.
Describe the Meter
Concisely describe the meter, including all key details of your definition.
Example (geode.cache.gets
): "Total time and count for GET requests from Java or native clients."
Note how this description identifies an important boundary of measurement: It measures only those GET requests from Java clients and native clients. Including such details in your description helps your audience understand what is included in the measurement and what is excluded.
Identify the Unit of Measure
If the unit of measure is not obvious from the meter name, identify the unit of measure.
Define Tags
A tag is a key/value pair that represents some detail about the source or circumstances of a measurement.
General advice:
- Define tags to identify each important scope, and especially those scopes required to uniquely identify the attribute being measured.
- Define tags to identify the circumstances in which this meter is selected to report measurements.
- Spell each word in the tag key using only lower case letters.
- If a tag key has multiple words, separate the words with dots.
- Each combination of tag values results in a separate meter. Therefore:
- Add a tag only if it clearly helps to satisfy your stated purpose for adding the meter.
- Use caution when deciding whether to add a tag that merely describes a scope.
Example: The geode.cache.gets
meter has these tags:
- The
region
tag identifies the entity whose get operations are reported by the meter. - The
result
tag describes a circumstance under which this meter is selected to report a measurement: Cache hit or cache miss.
The geode.cache.gets
meter also has these pre-defined tags, which Geode automatically adds to every meter:
- The
member
tag identifies a scope: The member that served the get operations. - The
host
tag identifies a scope: The host on which the member is running. - The
cluster
identifies a scope: The cluster in which the region exists. - The
member_type
tag describes a scope by giving additional facts: The type of the member.
Example: The jvm.memory.used
meter (defined by Micrometer) has these tags:
- The
id
tag identifies the entity whose memory being measured: The specific pool of memory (e.g.PS Eden Space
). - The
area
tag identifies a scope: The memory area that manages the pool (heap
ornonheap
). - Other pre-defined tags added by Geode, which together identify and describe additional scopes.
Pre-defined tags. Geode's metrics framework automatically adds several tags to each meter:
member
: The name of the member in which the meter is registered.member_type
: The type of member in which the meter is registered.host
: The name of the host on which the member is running.cluster
: The ID of the cluster that includes the member.
You do not need to add these tags yourself.
Tag names and values. Micrometer does not allow null tag keys and tag values. Some meter registry implementations do not allow empty tag values.
Meter ID = name + tags. A meter is identified not only by its name, but by its name and its tags. Thus each combination of name and tags creates a distinct meter.
Combinations of tag keys. Within a single meter registry, make sure that every meter with a given name has exactly the same set of tag keys:
- If any meter has a
foo
tag, then every meter with the same name must also have afoo
tag. - If any meter lacks
foo
tag, then no meter with the same name may have afoo
tag.
This restriction arises from certain meter registry implementations, such as Micrometer's PrometheusMeterRegistry
, that users may wish to use to publish Geode's meters to external monitoring systems.
Note that it is specifically the PrometheusMeterRegistry
, and not Prometheus itself, that enforces the restriction. Prometheus appears to allow similarly-named meters to have different sets of tag keys. This means it is permissible (by Prometheus, at least) for tag keys to differ between Geode instances.
We have not tested other monitoring systems to verify whether they similarly allow tag keys to differ between Geode instances.
Note also that this restriction applies only to the set of tag keys. Tag values may vary freely from meter to meter.
Evaluate Sources of Information
General advice (details TBD):
- Before looking for instrumentation sites:
- Define the purpose of the meter as well as you can. Without a clear purpose to guide instrumentation, it is distressingly easy to select instrumentation sites that are incomplete, incorrect, or otherwise inappropriate.
- Define the measurement as well as you can. Without a clear definition of the measurement—and especially a clear definition of the scope of the measurement—it is distressingly easy to select instrumentation sites that are incomplete, incorrect, or otherwise inappropriate.
- Identify candidate sources of information.
- To identify candidate sources of information:
- Identify each class that forms part or all of a boundary, entity, or other scope that you identified in your definition of the measurement.
- Identify each class that participates in the kind of event or operation that you want to measure.
- Use extreme caution when considering existing stats classes as candidate sources of information. Existing stats classes:
- Are never primary sources of information.
- Are often surprisingly unreliable sources of information.
- Can be useful starting points for identifying potential sources of information. If a stats class appears to report some or all of the desired measurement:
- Identify the stats class methods that update the stat.
- Identify the Geode code that calls those methods.
- Consider each caller of those methods (and not the stats class itself) a candidate source of information.
- Ask of each candidate source of information:
- Does this source already compute exactly the quantity you want to measure?
- Does this source know all of the information required to make and report a measurement?
- Does this source already apply the desired selection criteria to decide whether and how to report a measurement?
- Does this source observe all of the events you want to measure? If not, you will need to identify sites that observe the remaining events.
- Does this source observe only the events you want to measure? If not, does the site have sufficient information to decide whether it is the kind of event you want to measure?
- The challenge is to find a set of instrumentation sites that, together, observe all and only those events you want to measure, with sufficient information to select whether and how to report a the measurement as you have defined it, for the purpose you have described.
Instrument the Code
General advice (details in sections below):
- Select an appropriate Meter implementation
- Place the meter in a stats class
- Manage the meter's lifetime
- Avoid redundant meters
Select a Meter Implementation
Micrometer defines a number of meter types. See the Micrometer documentation for details. Geode adds several custom meter types (noted below) that associate meters with stats.
Choose the appropriate meter implementation depending on:
- The nature of the measured attribute.
- Whether you want to report measurements through associated stats.
Counters. A counter represents a monotonic increasing quantity. Each counter has a count()
method that reports its measured value.
- A
Counter
accumulates the values reported to itsincrement()
methods and stores the accumulated value. Use aCounter
when no existing source naturally accumulates or computes the desired count, and you do not wish to report the count via a stat. - A
FunctionCounter
retrieves a fresh measurement from a supplier or other object each time itscount()
method is called. Use aFunctionCounter
when an existing non-stat source naturally accumulates or computes the desired count. - A
LegacyStatCounter
(a custom Geode meter type) accumulates the values reported to itsincrement()
methods, and forwards each increment to both an associated stat and a registeredCounter
. Use aLegacyStatCounter
when you want to report identical counts through both a stat and a meter.
Gauges. A gauge represents a quantity that can go up or down. Each gauge has a value()
method that reports its measured value.
- A
Gauge
retrieves a fresh measurement from a supplier or other object each timevalue()
is called. Use aGauge
when an existing non-stat source naturally accumulates or computes the desired value. - A
TimeGauge
represents an instant or duration, and retrieves a fresh measurement from a supplier or other object each timevalue()
method is called. Use aTimeGauge
when an existing non-stat source naturally accumulates or computes the desired value.
Timers. A timer represents both the total number of occurrences of some event and the total durations of those events. Each timer has a count()
method that reports the number events and a totalTime()
method that reports the total duration of events.
- A Timer accumulates the number and durations of events reported to its
record()
methods and stores the accumulated count and duration. Use a Timer if no existing source naturally accumulates or computes the desired counts and durations, and you do not wish to report the measurements via stats. - A
FunctionTimer
retrieves the relevant fresh measurement from an object each timecount()
ortotalTime()
is called. Use a FunctionTimer when an existing non-stat source naturally accumulates or stores the desired counts and durations. - A
LegacyStatTimer
(a custom Geode meter type) accumulates the counts and durations reported to itsrecord()
methods, and forwards the increments both to associated stats and to a registered Timer. Use aLegacyStatTimer
when you want to report identical counts and durations through both stats and meters.
Place the Meter in a Stats Class
Encapsulate meters in stats classes. Create and register meters only in stats classes. Interact with meters only in stats classes. Use stats classes to manage the lifetime of meters.
Rationale. Much existing Geode code already uses one or more domain-specific stats classes for instrumentation. Placing meters in existing stats classes avoids complicating the domain code with additional instrumentation noise.
Even if no relevant stats class exists, creating a new stats class to encapsulate meters allows the instrumented code to focus on reporting domain events (e.g. reporting a get operation just finished) rather than on the non-domain details of what and how to measure. And adding a stats class allows instrumenting the code using an already ubiquitous style of instrumentation.
Adding or changing a stats class. It is uncommon for an existing stats method to know exactly the information required for a new meter.
- You may need to add a meter registry parameter to an existing stats class's constructor, so that the class can register the new meters.
- You may need to add new instrumentation methods to an existing stats class. If so, name the method to express the domain event that it reports (e.g.
endGet()
) and not the details of what it measures (e.g.incGetCount()
). - You may need to add new parameters to an existing stats class's instrumentation methods.
- You may need to create a new stats class to encapsulate the new meters.
Obtain the Meter Registry
During cache creation, Geode automatically creates and configures its meter registry. The registry is managed by a "metrics service" owned by the InternalDistributedSystem
. You can obtain the meter registry through the InternalDistributedSystem
or, for convenience, from the InternalCache
:
MeterRegistry meterRegistry = internalDistributedSystem.getMeterRegistry();
MeterRegistry meterRegistry = internalCache.getMeterRegistry();
The code you are instrumenting, or the stats class in which you are adding the meters, may also offer access to Geode's meter registry.
Add the Meter to Geode's Meter Registry
Each meter type includes a builder that you can use to progressively define a meter, then register the defined meter with the meter registry.
Timer example:
Timer cacheGetsHitTimer = Timer.builder("geode.cache.gets") .description("Total time and count for GET requests from Java or native clients.") .tag("region", region.getName()) .tag("result", "hit") .register(meterRegistry)
Gauge example:
Gauge entriesGauge = Gauge.builder("geode.cache.entries", region::getLocalSize) .description("Current number of entries in the region.") .tag("region", region.getName()) .tag("data.policy", region.getDataPolicy().toString()) .baseUnit("entries") .register(meterRegistry);
Note that when you build a Gauge
, you must tell it how to make a measurement. In this example, line 1 configures the gauge to use a Supplier<Number>
(defined by the region::getLocalSize
method reference) to measure the entry count.
An alternate builder()
method takes a T
object and a ToDoubleFunction<T>
, and creates its own measurement supplier that applies the given function to the given object.
FunctionCounter
and FunctionTimer
are configured similarly. You must tell them how to make their measurements.
LegacyStatCounter example:
Counter eventsReceivedCounter = LegacyStatCounter.builder("geode.gateway.receiver.events") .longStatistic(stats, eventsReceivedId) .description("total number events across the batched received by this GatewayReceiver") .baseUnit("operations") .register(meterRegistry);
Note that line 2 links the LegacyStatCounter
to a specific statistic (eventsReceivedId
) in a specific Statistics
instance (stats
).
The LegacyStatCounter
builder also has a doubleStatistic
method that links the counter to a double
stat.
LegacyStatTimer
is configured similarly, using builder methods that allow you to forward its count and duration increments to associated long
or double
stats.
Manage the Meter's Lifetime
Give each meter the same lifetime as the entity whose attributes it measures:
- Edit or add the stats class's
close()
method to remove all of its meters from the registry and close them. - Ensure the owner of the stats calls
stats.close()
when the relevant entity is destroyed, removed, or closed.
Example: Each geode.cache.entries
meter reports the number of entries in a given region. Each region's geode.cache.entries
meter should be registered when the region is created and removed from the registry and closed when the region is destroyed.
Rationale: Each meter consumes memory. Publishing each meter consumes CPU cycles. In long-running systems, where the measured objects come and go, leftover meters can accumulate, consuming an increasing amount of memory and CPU time.
Geode's of Micrometer allows the user to publish the measurements to external monitoring systems for long-term storage. As a result, it is unnecessary for Geode to retain meters that measure objects that no longer exist.
Avoid Redundant Meters
Do not create meters whose values can be derived from other meters. For example:
- Do not report rates. Example: Do not report operations per second.
- Do not report ratios. Example: Do not report hit percentage.
- Do not report aggregates. Example: Do not sum region entry counts to compute the entry count for the cache as a whole.
Rationale. External monitoring systems can compute these derived values from the series of measurements over time.