IDIEP-84
Author
Sponsor
Created

 

Status

ACTIVE

Motivation

It is important to have a uniform way to provide exceptions or error messages to end users. The goal of this design is to give a draft of exception classes to be used in public API, as well as main expectations from their usages.

Requirements

  1. A public exception must have an error code. Each error code will be documented. This makes it easier to guide a user through troubleshooting.
  2. An internal exception can have an error code.
  3. Nevertheless, it is recommended to use standard Java exceptions in cases where applicable (for public and internal APIs).
  4. In general, unchecked exceptions are preferred on public API. Checked exceptions are allowed on public API in cases where API forces a user to handle such exceptions. E.g. retries. It’s developers’ choice about using checked or unchecked exceptions on internal API.
  5. An error code consists of two parts:
    1. An error group - a text identifier that is unique and specific for some module (vendor, functional domain).
    2. An error identifier - an integer numeric identifier that is unique for a particular error group.
  6. An error code implementation must provide extensibility for modules, vendors, etc. And must not require modification of core modules in order to introduce a new error code.
  7. An exception must also provide a message which specifies the context and conditions under which the error occurred. 
  8. An exception should provide additional information about the error as an exception’s cause.
  9. Under normal conditions, we should avoid transferring stack traces throughout a network. But it must be possible to turn on some kind of debug mode which will lead to transferring stack traces from node to node in order to simplify development and debug.
  10. The important concept is error traceability. It must be possible to track the error on the cluster. It can be achieved by introducing a unique error ID which should be passed from one exception to another and also should be printed in a log. Such an approach simplifies troubleshooting and logs analysis.
  11. While there are some programming languages that do not support exceptions it is a client/extension developers’ responsibility to translate Java exceptions from public API to the language-specific error handling system.

Description

Error groups and error codes

First proposed abstraction is a concept of error groups. It is similar to what was called an ErrorScope in Devlist Discussion.

The main idea is that all errors in Ignite will be grouped. The way to identify an error type is to have a pair - (group; code), where code represents an integer number, unique within a group.

For example,  (TABLE, 1) and (SQL, 1) are both valid errors despite the apparent collision.

Each group defines a collection of errors that belong to a single component/module/vendor. For example, RAFT, TABLE or SQL. It is convenient to have an integer code for groups as well, so that users won’t have to compare strings in their code. Hence each group must be identified by a unique name and code. See class ErrorGroup for the reference. Uniqueness must be guaranteed by a newGroup method, which is a single point to create new error groups.

So, each error type could be represented in two different ways - as a human-readable string and as an integer number. First is used exclusively in text - logs and error messages. Second - exclusively in code:

  • Name should be formatted like this: IGN-XXX-nnn, where XXX is a group name and nnn is an unique error code within a group. Additional IGN prefix will simplify googling the error.
  • Number code should be calculated using following code fragment:
    (groupCode <<< 16) | (0xFFFF & errorCode).
    This restricts all group codes and individual error codes to 16 bits only, which is still more than enough.

So, numeric error code includes both group code and an internal unique code. These codes should be stored in constants and be documented. Please refer to the code examples for specifics:  ErrorGroupRaftErrors

Exceptions tracing

Transferring a stack trace to thin clients or other server nodes is not always necessary. Not only it pollutes logs, but also creates pressure on the network, or does some other bad stuff. There might be many different reasons.

So, stack trace transferring must be optional. It should be disabled by default, and only enabled explicitly by the user, by enabling a debug mode, that could be a boolean cluster-wide setting.

When stack traces are not transferred, it’s still important to be able to locate a specific exception instance in server logs. For this purpose we should introduce a unique identifier for every public exception. Let’s call it a trace id. Basically - it is a UUID that’s generated in the exception constructor, always passed to clients during serialization and always printed to logs. For example:

IGN-STORE-21: No space left on device. Please free more space and restart your Ignite node. Trace id: 0b3ce41b-000b-4301-83bb-ec2a306e123a

To improve usability, messages should have no line separators. This way if the user only copies the first line of the exception, it would still contain description, error code and the trace id. It could also be convenient to show error codes for all causes on the same line because users often look at a single line only. All specifics are not yet clear, given that we have no real examples implemented, there will be more small discussions for every case.

Originating node id could also help locating the problem, if it’s not a regular “column already exists” exception. For those, trace id and originating node id are optional.

Exception classes

Basically, we have to have exception classes with error code information, let’s call them IgniteException and IgniteCheckedException. These classes, by themselves or via subclasses, should be thrown to users in the public API. See IgniteException as a draft of the final implementation.

Standard Java exceptions should be used when applicable, there’s no need to have our own IllegalArgumentException.

All public API methods should document codes of exceptions that they throw. Unchecked exceptions are preferred and should be used almost always in the public API. Critical exceptions that must be handled by the user (require manual operation retry, for example) should be implemented as checked exceptions so that users would be forced to handle them, or at least notice.

Examples of how specific exceptions classes could be integrated with described model could be found in this section: Specific Exceptions

Exceptions serialization

Since there are other languages than Java, we should have a generic way to convert exceptions between different representations. Current errors serialization design is described here: IEP-76 Thin Client Protocol for Ignite 3.0

We should expand the collection of required fields. It should be enough to deserialize exception objects in Java client as close to the original object as possible. This includes cause exceptions.

Not yet sure about suppressed exceptions. So, we would need a list of errors, where the first element is the head and every next element is the cause for the previous one. Element contains:

  • Error code (int)
  • Error message (String)
  • [Optional] Stack trace, for head only (String)
  • Full class name (String)

Open problem:

  • What should we do with standard Java exceptions, like TimeoutException or IllegalArgumentException, or even NPE? Right now it's better to have a reserved error group for them and assign a specific codes to all "known" types.

Guidelines and restrictions

As you see, all error groups are only added at runtime. There’s no compile-time validation that there are no collisions. This comes with a set of problems:

  • Late collision detection - we should rely on tests to find them. Such checks could only be performed when a full set of error groups is registered, we have integrational tests for this.
  • Difficulties in maintaining collision-free lists of errors between releases. Let’s say that the developer creates a patch for version 3.0.x with new error code “IGN-ABC-123”. There’s no way to avoid collision with introducing the same code for another error in version 3.1.x (for example). This could only be resolved by a good set of compatibility tests (which is still hard for not yet released master versions) or by maintaining a golden standard list of errors somewhere independently from the source code, as it was done for IgniteFeatures class in Ignite 2.x.

Implementation draft

ErrorGroup

public class ErrorGroup {
    // Private constructor protects from arbitrary group creation.
    private ErrorGroup(int code, String name) {
        // ...
    }

    public int code() {
        // ...
    }

    // I’d suggest forcing the regex check, something like “^[A-Z0-9]{3,7}$”
    public String name() {
        // ...
    }

    public int makePublicCode(int code) {
        // Check code range.
        return (code() << 16) | (code & 0xFFFF);
    }

    public static synchronized ErrorGroup newGroup(int code, String name) {
        // Range check for the code.
        // Regex check for the name.
        // Uniqueness check for both name and code.
        return new ErrorGroup(code, name);
    }
}

RaftErrors

// Usage example:
public class RaftErrors {
    // This is the error group for the RAFT.
    public static final ErrorGroup RAFT_ERR_GROUP = ErrorGroup.newGroup(10, “RFT”);

    public static final ErrorGroup OTHER_ERR_GROUP = ErrorGroup.newGroup(11, “RFT”);


    // These are public constants for users to check in their catch blocks.
    public static final int SPLIT_BRAIN_ERR = RAFT_ERR_GROUP.makePublicCode(1);

    public static final int TIMEOUT_ERR = RAFT_ERR_GROUP.makePublicCode(2);

    public static final int TX_ERR = RAFT_ERR_GROUP.makePublicCode(3);
}

IgniteException

// This is a draft for public runtime exceptions implementation.
public class IgniteException extends RuntimeException {
    private final ErrorGroup group;
    private final int publicCode;

    // Trace id is a unique exception identifier that should help locating
    // the error message in logs.
    private final UUID traceId;

    // This constructor is only an example. Of course, there will be a
    // variety of constructors for different cases - with or without a
    // cause, different trace id generation strategies, etc.
    public /* ? */ IgniteException(
        ErrorGroup group, int code, String message, UUID traceId
    ) {
        super(makeMessage(group, code, message, traceId));

        // Check that error group from the code matches passed group.

        this.group = group;
        this.publicCode = code;
        this.traceId = traceId;
    }

    // Accessor that’s used by the end user. Returns constant, previously
    // generated by “makePublicCode”.
    public int errorCode() {
        return code;
    }

    public UUID traceId() {
        return traceId;
    }

    private static String makeMessage(
        ErrorGroup group, int code, String message, UUID traceId
    ) {
        return “IGN-” + group.name()
            + “-” + (code & 0xFFFF) + “: “ + message
            + “. Trace id: “ + traceId;
    }

    // This method might be useful, but I can’t think of any specific
    // usages right now.
    public String humanReadableCode() {
        return “IGN-” + group.name() + “-” + (publicCode & 0xFFFF);
    }
}

Specific Exceptions

public class SqlCheckedException extends IgniteCheckedException {
    // Constructor for specific types of exceptions should not specify
    // error group, because it’s always the same.
    public SqlException(<params>) {
        super(SqlErrors.SQL_ERR_GROUP, <params>);
    }
}

public class SqlTxRollbackCheckedException extends SqlCheckedException {
    // Some exception types might specify an error code and differ
    // be a message only. This way we could still have a flexible hierarchy
    // that makes sense from an error code standpoint.
    public SqlTxRollbackCheckedException(<params>) {
        super(SqlErrors.SQL_ERR_TX_ROLLBACK, <params>);
    }
}

public class IgniteInternalCodedException extends IgniteInternalException {
    ...
}

Open Tickets

key summary type created updated due assignee reporter customfield_12311032 customfield_12311037 customfield_12311022 customfield_12311027 priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

Closed Tickets

key summary type created updated due assignee reporter customfield_12311032 customfield_12311037 customfield_12311022 customfield_12311027 priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels