Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

skipped-records  moved to lower level)

LEVEL 0LEVEL 1LEVEL 2LEVEL 3LEVEL 3LEVEL 3

Per-Client

Per-Thread

Per-Task 

Per-Processor-Node Per-State-StorePer-Cache
TAGS

type=stream-metrics,client-id=[client-id]

type=stream-thread-metrics,thread-name=[threadId]


(! tag name changed)

type=stream-task-metrics,thread-name=[threadId],task-id=[taskId]


(! tag name changed)

type=stream-processor-node-metrics,thread-name=[threadId],task-id=[taskId],processor-node-id=[processorNodeId]


(! tag name changed)

stream-state-metrics,thread-name=[threadId],thread-name=[taskId],[storeType]-state-id=[storeName]


(! tag name changed)

type=stream-record-cache-metrics,thread-name=[threadId],task-id=[taskId],record-cache-id=[storeName]


(! tag name changed)

version | commit-id (static gauge)
INFO ($)




application-id (static gauge)
INFO ($)




topology-description (static gauge)
INFO ($)




state (dynamic gauge)
INFO ($)




process-latency (avg | max)

INFODEBUG(! removed for now)

process (rate | total)

INFODEBUG ( → ) on source-nodes onlyDEBUG

punctuate-latency (avg | max)

INFODEBUG


punctuate (rate | total)

INFODEBUG


commit-latency (avg | max)

INFODEBUG


commit (rate | total)

INFODEBUG


poll-latency (avg | max)

INFO



poll (rate | total)

INFO



task-created | closed (rate | total)

INFO



active-task-process (ratio)


INFO ($)


standby-task-process (ratio)


INFO ($)


dropped-records (rate | total)


INFO * (→)

DEBUG * (a subset of processor only)

                 (! name changed)

$)




enforced-process
 (rate | total)


DEBUG (! INFO *
enforced-processing (rate | total)
DEBUGrenamed)


record-lateness (avg | max)


DEBUG


suppression-emit (rate | total)



DEBUG * (suppress processor only)

suppression-buffer-size (avg | max)




DEBUG * (suppression buffer only)
suppression-buffer-count (avg | max)




DEBUG * (suppression buffer only)
expired-window-record-drop (rate | total)




DEBUG * (window store only)
put | put-if-absent .. | get-latency (avg | max)




DEBUG * (excluding suppression buffer)

                 (! name changed)


put | put-if-absent .. | get (rate)




DEBUG * (excluding suppression buffer)

                 (! name changed)


hit-ratio (avg | min | max)





DEBUG  (! name changed)

...

  1. We will remove most of the parent sensors with `level-tag=all` except two cases.  The main idea is to let users to do rolling-ups themselves only if necessary so that we can save necessary metrics value aggregations. For those two exceptional cases, two parent-child sensor relationship is maintained because it is a bit tricky for users to do the rolling up correctly.
  2. We will keep all LEVEL-0 (instance) and LEVEL-1 (thread) sensors as INFO, and most of lower level sensors as DEBUG reporting level. They only exception is active/standby-task-process and dropped / skipp-records
    1. active/standby-task-process indicate the percentage that the current hosting thread is spending on processing them.
    2. dropped/skipped records indicate unexpected errors during processing and hence need to be paid attention by users. Their semantics though are a bit different: skipped records are those skipped at the very beginning of the process and hence not even traverse the topology at all; dropped-records are those dropped in the middle of the topology, and are not necessarily corresponding to a 1-1 mapping to the source records since one source records may be transformed to multiple intermediate records which are then dropped later.
  3. Some of the lower level metrics like "forward-rate" and "destroy-rate" are removed directly since they are overlapping with other existing metrics already.
  4. For some
  5. For some metrics that are only useful for a specific type of entities, like "
  6. expired-window-record-drop
  7. suppression-emit", we will only create the sensors lazily in order to save unnecessary costs for metrics reporters to iterate those empty sensors.
  8. Some of the lower level metrics like "forward-rate" and "destroy-rate" are removed directly since they are overlapping with other existing metrics already. Here are a list of removed / replaced sensors:

Code Block
late-records-drop: INFO at processor node level, replaced by INFO task-level "dropped-records".

skipped-records: INFO at thread and processor node level, replaced by INFO task-level "dropped-records".

expired-window-record-drop: DEBUG at state store level, replaced by INFO task-level "dropped-records".

forward-rate: DEBUG at processor-node level, replaced by DEBUG processor node level "process-rate".

destory-rate: DEBUG at task-level, covered by INFO thread-level "task-closed-rate".

create-rate: DEBUG at task-level, covered by INFO thread-level "task-create-rate".



Proposed Changes

As above.

...