...
LEVEL 0 | LEVEL 1 | LEVEL 2 | LEVEL 3 | LEVEL 3 | LEVEL 3 | |||||
Per-Client | Per-Thread | Per-Task | Per-Processor-Node | Per-State-Store | Per-Cache | |||||
---|---|---|---|---|---|---|---|---|---|---|
TAGS | type=stream-metrics,client-id=[client-id] | type=stream-thread-metrics,thread-name=[threadId] (! tag name changed) | type=stream-task-metrics,thread-name=[threadId],task-id=[taskId] (! tag name changed) | type=stream-processor-node-metrics,thread-name=[threadId],task-id=[taskId],processor-node-id=[processorNodeId] (! tag name changed) | stream-state-metrics,thread-name=[threadId],thread-name=[taskId],[storeType]-state-id=[storeName] (! tag name changed) | type=stream-record-cache-metrics,thread-name=[threadId],task-id=[taskId],record-cache-id=[storeName] (! tag name changed) | ||||
version | commit-id (static gauge) | INFO ($) | |||||||||
application-id (static gauge) | INFO ($) | |||||||||
topology-description (static gauge) | INFO ($) | |||||||||
state (dynamic gauge) | INFO ($) | |||||||||
process-latency (avg | max) | INFO | DEBUG | (! removed for now) | |||||||
process (rate | total) | INFO | DEBUG ( → ) on source-nodes only | DEBUG | |||||||
punctuate-latency (avg | max) | INFO | DEBUG | ||||||||
punctuate (rate | total) | INFO | DEBUG | ||||||||
commit-latency (avg | max) | INFO | DEBUG | ||||||||
commit (rate | total) | INFO | DEBUG | ||||||||
poll-latency (avg | max) | INFO | |||||||||
poll (rate | total) | INFO | |||||||||
task-created | closed (rate | total) | INFO | |||||||||
active-task-process (ratio) | INFO ($) | |||||||||
standby-task-process (ratio) | INFO ($) | |||||||||
dropped-records (rate | total) | INFO * (→) | DEBUG * (a subset of processor only) (! name changed) | $) | |||||||
enforced-process | skipped-records (rate | total) | DEBUG (! | moved to lower level)INFO * | enforced-processing (rate | total) | DEBUGrenamed) | |||||
record-lateness (avg | max) | DEBUG | |||||||||
suppression-emit (rate | total) | DEBUG * (suppress processor only) | |||||||||
suppression-buffer-size (avg | max) | DEBUG * (suppression buffer only) | |||||||||
suppression-buffer-count (avg | max) | DEBUG * (suppression buffer only) | |||||||||
expired-window-record-drop (rate | total) | DEBUG * (window store only) | |||||||||
put | put-if-absent .. | get-latency (avg | max) | DEBUG * (excluding suppression buffer) (! name changed) | |||||||||
put | put-if-absent .. | get (rate) | DEBUG * (excluding suppression buffer) (! name changed) | |||||||||
hit-ratio (avg | min | max) | DEBUG (! name changed) |
...
- We will remove most of the parent sensors with `level-tag=all` except two cases. The main idea is to let users to do rolling-ups themselves only if necessary so that we can save necessary metrics value aggregations. For those two exceptional cases, two parent-child sensor relationship is maintained because it is a bit tricky for users to do the rolling up correctly.
- We will keep all LEVEL-0 (instance) and LEVEL-1 (thread) sensors as INFO, and most of lower level sensors as DEBUG reporting level. They only exception is active/standby-task-process and dropped / skipp-records
- active/standby-task-process indicate the percentage that the current hosting thread is spending on processing them.
- dropped/skipped records indicate unexpected errors during processing and hence need to be paid attention by users. Their semantics though are a bit different: skipped records are those skipped at the very beginning of the process and hence not even traverse the topology at all; dropped-records are those dropped in the middle of the topology, and are not necessarily corresponding to a 1-1 mapping to the source records since one source records may be transformed to multiple intermediate records which are then dropped later.
- Some of the lower level metrics like "forward-rate" and "destroy-rate" are removed directly since they are overlapping with other existing metrics already. For some
- For some metrics that are only useful for a specific type of entities, like " expired-window-record-drop
- suppression-emit", we will only create the sensors lazily in order to save unnecessary costs for metrics reporters to iterate those empty sensors.
- Some of the lower level metrics like "forward-rate" and "destroy-rate" are removed directly since they are overlapping with other existing metrics already. Here are a list of removed / replaced sensors:
Code Block |
---|
late-records-drop: INFO at processor node level, replaced by INFO task-level "dropped-records". skipped-records: INFO at thread and processor node level, replaced by INFO task-level "dropped-records". expired-window-record-drop: DEBUG at state store level, replaced by INFO task-level "dropped-records". forward-rate: DEBUG at processor-node level, replaced by DEBUG processor node level "process-rate". destory-rate: DEBUG at task-level, covered by INFO thread-level "task-closed-rate". create-rate: DEBUG at task-level, covered by INFO thread-level "task-create-rate". |
Proposed Changes
As above.
...