Logs are one of the most important parts of any running application. The operations team uses them to make sure the application runs smoothly. Developers use the log for troubleshooting.

Below are rules aimed to improve logging in Ignite 3 through log statement unifying and providing a guidelines on how to use the logging subsystem.

Static vs Non-Static Logger

Instantiation of a new Logger is depended upon particular logging backend and potentially may be a fairly expensive operation. This is why a static logger should be preferred over a non-static one for server side usage. On the other hand, client-side logging should not follow this rule as particular backend may be provided on per-client basis, hence all loggers should be instantiated with provided LoggerFactory.

Message Format

As more deployments move to public clouds, there is growing demand for a log management and analytics solution to handle any server logs. This is why it is important not only to make a log statement human readable, but also a machine readable. In addition, a unified log format eases the log analyzing and for the engineer too. Here is a log format to use within the Ignite-3:

<Short description of the event> [<key value pairs describing context of the event>]

, for example:

LOG.info("Start new raft node [group={}, peers={}]", groupId, peers)
LOG.debug("Replicator added [node={}, term={}, peer={}]", getNodeId(), this.currTerm, peer);

Besides the format, it's important to keep messages short and simple yet sufficient to understand what is happened:

log.info("New leader was elected for the raft group and pending reconfiguration was discovered [group={}, partition={}, table={}, peers={}]", ...); // this one too cumbersome
log.info("New leader elected. Going to reconfigure peers [group={}, partition={}, table={}, peers={}]", ...); // short and simple yet provides the same information as the first one

Message Templates

There is support for string substitution within log message with backend-independent anchors '{}'. Message templating should be preferred to string concatenation, because such approach looks a way cleaner while reading the code. Please take a look at example below:

log.debug("Move pages from heap to PageMemory [list=" + name() + ", bucket=" + bucket
    + ", pages=" + pages + ']');
                   VS
log.debug("Move pages from heap to PageMemory [list={}, bucket={}, pages={}]", name(), bucket, pages);

Context Matters

Please take a look at the example below:

Restore partitions state

This line provides neither information about table nor result of restoration.

It could be improved by providing a context as well as result of the operation:

Partitions state restored [tableName=<tName>, state=<restored state or null if there is no state to restore>]

Choosing the Log Level

While the right choice depends on many factors, here are some advices on what each level serves for:

ERROR

Operation has failed because of reasons not related to the user. Normal cluster functioning is in question (there could be auto recovery though). Attention of the maintenance team is required immediately.

Rule of thumb: ERROR is when you need a duty engineer to wake up in the middle of the night. Remember that sometimes it’s you. ERROR is always actionable - the user must understand how to react.

Examples:

  • Unable to recover from persisted state 
  • Found B-Tree corruption


NB: usually you don't need to log exception with ERROR or WARN level, if this exception would be re thrown up or returned as a result, because it's up to caller to decide whether try to recover or just to crash.

WARN

The cluster is working fine, but we are close to some kind of error. The attention of the maintenance team is required, but not immediately.

Rule of thumb: WARN is what the team will need to review at the start of the next business day, at a high priority. WARN is always actionable - the user must understand how to react.

Examples:

  • Resource consumption is above the threshold
  • Ignoring some configuration properties with reason disclosed
  • Unable to send a message to a node because of network problem (recovery is possible here)

INFO

A level for regular non frequent event. Should be used to log out any change of cluster state. Should be avoided for frequent operations like get or put.

Rule of thumb: INFO is good for anything that’s infrequent and will give info to support or development teams when troubleshooting. If an event happens no more frequently than every few minutes (such as page memory checkpoint), it probably should be logged at INFO.

Examples:

  • Remote node joining/leaving the cluster
  • Cluster activation/deactivation
  • Recovery of a component from persisted state

DEBUG

Additional information that could help to debug a problem. Could be used to log additional context to the INFO level or events that won't be needed in a normal situation.

Examples:

  • Additional context to the operation logged on INFO level
  • Attempt of retrying some operation
  • Errors occurring while doing some processing, before falling back to another approach

TRACE

Rule of thumb: TRACE is similar to DEBUG but used for stuff that can be really frequent (like a table insert). Usually you don't want this level until you are 100% sure you need this.

  • No labels