This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.
Status
Current state: "Under Discussion"
Discussion thread: here
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
When running a Trogdor cluster, it is useful to get information about the health of the Trogdor cluster itself.
Currently, a user would need to query Trogdor’s REST API in order to get any sort of information about a Trogdor cluster. This presents a significant burden on the user and limits the amount of information readily and easily available in terms of the health of a Trogdor cluster. Thus, adding metrics would allow for significant ease in monitoring agents and tasks in Trogdor clusters.
Public Interfaces
We define a new trogdor-metrics group that captures the metrics as defined below.
Metric/Attribute Name | Description |
---|---|
active-agents-count | The total number of active agents in the Trogdor cluster |
created-task-count | The total number of created tasks in the Trogdor cluster |
running-task-count | The total number of running tasks in the Trogdor cluster |
done-task-count | The total number of done tasks in the Trogdor cluster |
All metrics listed above are simply cumulative sums of the number of tasks/agents in each respective state. Thus, since these are cumulative sums, it is expected that the created-task-count = running-task-count = done-task-count when a Trogdor cluster has finished all tasks.
Proposed Changes
We propose adding a TrogdorMetrics class to Trogdor that exposes the aforementioned metrics. Since Trogdor agents and tasks share a common Platform class, a TrogdorContainer class will be created inside the Platform class to allow for the creation of a shared TrogdorMetrics instance between the Agent and Coordinator classes.
Compatibility, Deprecation, and Migration Plan
There should be no impact on compatibility, deprecation, or migration since this KIP simply adds some metrics to Trogdor.
Rejected Alternatives
Since there technically is a STOPPING state for a task in addition to PENDING, RUNNING and DONE, it would be nice to have metrics for each of these states.
However, by way of simple mathematics, we are able to deduce the number of pending tasks by simply subtracting the number of pending tasks from those that are running and done. Similarly, we are able to deduce the number of running tasks from those that are pending and done. The number of done tasks will be the true number of done tasks, with no mathematics necessary. This allows for the tracking of fewer metrics. The STOPPING state is more of a transient state and thus doesn’t add too much significance to metrics, so it was deemed useful to only have metrics tracking PENDING, RUNNING, and DONE tasks.