...
First, we can distinguish between retryable and non-retriable (ie, fatal) recoverable and fatal exceptions. Retryable Recoverable exception should be handled internally and only never bubble out if a (configuare) amount of retries was unsuccessful. For non-retryable/to the user. For fatal exceptions, Kafka Streams is doomed to fail and cannot start/continue to process data.
Related to this are retriable exception. While retriable exception are recoverable in general, it might happen that the (configurable) retry counter is exceeded; for this case, we end up with an fatal exception.
The second category are "external" vs "internal" exception. By "external" we refer to any exception that could be returned by the brokers. "Internal" exceptions are those that are raised locally.
...
- We should never try to handle any fatal exceptions but clean up and shutdown.
- We should catch all those exceptions (all Java exception, generic
KafakException, a
for cleanup only and rethrow unmodified (they will eventually bubble out of the thread and trigger uncaught exception hander if one is registered)
- We should catch all those exceptions (all Java exception, generic
...
- We should only log those exception (with
ERROR
level) once at the thread level before they bubble out of the thread to avoid duplicate logging
- We should only log those exception (with
- We need to do fine grained exception handling, ie, catch exceptions individually instead of coarse grained and react accordingly
- All methods should have complete JavaDocs about exception they might throw
- All exception classes must have strictly defined semantics that are documented in their JavaDocs
- For retriable exception, we should throw a special
StreamsException
if reties are exhausted
KafkaConsumer | KafkaProducer | StreamsKafakClient | AdminClient | Streams API | |
---|---|---|---|---|---|
fatal (should never occur) | local: - IllegalArgumentExcetpion - IllegalStateException - WakeupException - InterruptExcetpion remote: - UnknownServerException | local: - IllegalArgumentExcetpion - IllegalStateException - WakeupException - InterruptExcetpion remote: - UnknownServerException - OffsetMetadataTooLarge - SerializationException (we use
| local: - IllegalArgumentExcetpion - IllegalStateException - WakeupException - InterruptExcetpion remote: - UnknownServerException - InvalidTopicException | local: - IllegalArgumentExcetpion - IllegalStateException - WakeupException - InterruptExcetpion remote: - UnknownServerException - InvalidTopicExcetpion | local: |
fatal | local: - ConfigException remote: - AuthorizationException (including all subclasses) - AuthenticationException (inlcuding all subclasses) - SecurityDisabledException - InvalidTopicException
| local: - ConfigException remote: - AuthorizationException (including all subclasses) - AuthenticationException (inlcuding all subclasses) - SecurityDisabledExcetpion - InvalidTopicException - UnkownTopicOrPartitionsException (retyable? refresh metadata?) - RecordBatchTooLargeException - RecordTooLargeException
| local: - ConfigException remote: - AuthorizationException (including all subclasses) - AuthenticationException (inlcuding all subclasses) - SecurityDisabledExcetpion
| local: - ConfigException remote: - AuthorizationException (including all subclasses) - AuthenticationException (inlcuding all subclasses) - SecurityDisabledExcetpion | local: - ConfigException - SerializationException |
retriable | local:
remote: - InvalidOffsetException (OffsetOutOfRangeException, NoOffsetForPartitionsException) - CommitFailedException - TimeoutException - QuotaViolationException? | local: remote: - CorruptedRecordException - NotEnoughReplicasAfterAppendException - OffsetOutOfRangeException (when can producer get this?) - TimeoutException TimeoutException - QuotaViolationException? - BufferExhausedException (verify) | local:
remote: | local:
remote: | local: |
recoverable | local:
remote:
| local:
remote: - ProducerFencedException | local:
remote: | local:
remote: | local: - LockException |
Having a look at all KafkaException
there are some exception we need to double check if they could bubble out any client (or maybe we should not care, an treat all of them as fatal/remote exceptions).
-> DataException, SchemaBuilderExcetpion, SchemaProjectorException, RequestTargetException, NotLeaderException, NotAssignedException, IllegalWorkerStateException, ConnectRestException, BadRequestException, AlreadyExistsException (might be possible to occur, or only TopicExistsException), NotFoundException, ApiException, InvalidTimestampException, InvalidGroupException, InvalidReplicationFactorException (might be possible, but inidcate bug), o.a.k.common.erros.InvalidOffsetExcetpion and o.a.k.common.errors.OffsetOutOfRangeException (side note: do those need cleanup – seems to be duplicates?), ReplicaNotAvailalbeException, UnknowServerException, OperationNotAttempedException, TransactionalCoordinatorFencedException, PolicyViolationException, ControllerMovedException, UnkownMemberIdException, InvalidConfigurationException, InvalidFetchSizeException, InvalidReplicaAssignmentException, OutOfOrderSequenceException, InconsistendGroupProtocolException, ReblanceInProgressException, LogDirNotFoundException, BrokerNotAvailableException, InvalidOffsetCommitSizeException, InvalidTxnTimeoutException, InvalidPartitionsException, TopicExistsException (cf. AlreadyExistException), InvalidTxnStateException, ProducerFencedException, UnsupportedForMessageFormatException, UnsupportedVersionException, InvalidSessionTimeoutException, InvalidRequestException, IllegalGenerationException, InvalidRequiredAckException,
-> RetryableException, CoordinatorLoadInProgressException, GroupLoadInProgressException, CoordinatorNotAvailalbeException, NotControllerException, RetryableCommitException, DuplicateSequenceNumberException, NotEnoughReplicasException, NotEnoughReplicasAfterAppendException, InvalidRecordExcetpionInvalidRecordException, NotCoordinatorForGroupException, DisconnectException, InvalidMetaDataException (NotLeaderForPartitionException, NoAvailableBrokersException, NetworkException, UnkonwTopicOrPartitionException, KafkaStoreException, StaleMetadataException, LeaderNotAvailalbeException), NotCoordinatorException, GroupCoordinatorNotAvailableException
Should never happen:
- UnsupportedVersionException (we do a startup check and should not start processing for this case in the first place)
Handled by client (consumer, producer, admin) internally and should never bubble out of a client: (verify)
- ConnectionException, RebalanceNeededException, InvalidPidMappingException, ConcurrentTransactionException, NotLeaderException, TransactionalCoordinatorFencedException, ControllerMovedException, UnkownMemberIdException, OutOfOrderSequenceException, CoordinatorLoadInProgressException,
Might bubble out:
-
What about
- TopicExistException (consumer group leader "split brain" – might be self-healing)
GroupLoadInProgressException, NotControllerException, NotCoordinatorException, NotCoordinatorForGroupException, StaleMetadataException, NetworkException,
Gliffy Diagram | ||||
---|---|---|---|---|
|
Gliffy Diagram