Proposal: Recovery from Parse Errors

Problem Statement and Requirements

The problem is that we are parsing a stream of messages from an input stream. Bad data is encountered. Can we recover and somehow find our way to the next good data and continue parsing more correct data?

At minimum the parse should not hang when trying to recover.

We want to characterize the situations when recovery is possible, and how it might be carried out.

We assume there is no back path to the sender, the sender being the system that is putting messages (or bad data) onto the stream that the Daffodil parser is reading from.

We assume that we can break/close the connection so that the sender must reconnect. However, we assume the sender has no idea if the data they are sending is good or bad. Forcing them to break and reconnect may only cause them to retransmit bad data on the new connection.

So, the sending-side system doesn't know it is sending bad data. It's just sending, sending, sending....

The consuming system (Daffodil parsing) has to cope and not fail no matter what is being sent. It can log errors, but it can't hang or fail.

Requirements on the DFDL Schema

Any "message format" DFDL schema has to not consume to end-of-stream in the case of malformed data.

The DFDL schema has to be designed to recover by having an element definition for an "unrecognized" element, available as a last choice branch.

A good design principle for DFDL schemas is to have an external variable which controls whether the parse fails on unrecognized data, or the schema provides for creating and populating an "unrecognized" element with the unrecognized data.

The question is, if the data is unrecognized, how do we know how much to capture from the input stream?

Failure Modes

The principle here has to be that the amount of data consumed is always at least bounded by some upper limit.

There are two kinds of failures to consider:

Unknown structure - schema is parsing successfully, header is recognized but some unknown payload within the message is unrecognized. There's two variants here:
1. we parsed enough to have positive knowledge of the message length
2. length cannot be determined positively
Garbage - parse is failing - This means some fields in the header are malformed so we can't even parse the header.

Unknown Payload - Known Length

In case (1a), we can determine the length, so we can consume that much into a unrecognized element. This is the easy case. We do need to deal with the case where the given length is entirely unreasonable however, as corrupted data, the corruption of which falls exactly within the length information of the header, may have a vast length value that is entirely unreasonable. So the length needs to be limited to some finite maximum.

Unknown Payload - Unknown Length

In case (1b) the ability to recover depends on ability to determine the length somehow. This is schema-specific. In case of textual data, useful heuristics are looking for a line-ending, form-feed, or other "likely terminator", or something that signifies likely end of the unrecognized payload. It also could look for a few bytes that are typical at the start of a new message. This is just a heuristic though.

But there also has to be a finite upper bound. A value might be chosen based on the minimum/maximum size of these messages if that is knowable.

I think if the element that grabs the unrecognized part of the message uses lengthKind="pattern" to express these heuristics, then daffodil's tunable limits on regex size matching will actually suffice to make this finite, and then those can be tuned depending on system needs.

So long as it has an upper bound on the length it will consume, that would make it not hang so long as the sender system continues to send more data.

If it cannot find some heuristic end of message within some number of bytes, it should probably just parse the header fields (if possible), and not parse any additional data, and try parsing again immediately after the header. Note that if the unrecognized element uses lengthKind pattern, this would be the default behavior, as failure to match is equivalent to finding length 0.

If this comes back length 0, then I think it makes sense to use daffodil's lookahead feature (not in standard DFDL as yet), to grab say the first 8 bytes of the message, for use in the unrecognized element. We may want to distinguish unrecognized with known length, unrecognized with heuristic length, and unrecognized starting with these 8 bytes. That may be overkill.

Garbage In Data Stream

In case (2), the data is garbage, we get a failed parse. We should log something (in software or via a DFDL recoverable-error assert - such time as Daffodil supports those), advance 1 byte, and try again. This will create a flood of log messages until it finds something that can parse successfully again, but if there are well formed messages later in the stream, then it should re-synchronize with them. A good design here would log a start-of-bad-data message, then enter a sub-loop of parsing that expects failures, counts and captures the failed bytes, and upon a successful parse or reaching N failures, emits a log message about a span of such bad data. This is way better than 1 log message per bad byte, but still allows 1 byte at a time to be consumed so as to eventually re-synchronize.

An assumption here is that the parses will likely fail fast until finally one succeeds.

It is considered to be a success if a parse returns a message infoset even if that contains an unrecognized payload element.

Timeouts - Finite Data Size and Finite Delay

Recovery requires that when data is unrecognized and doesn't match the schema, that we can parse an "unrecognized" element which has a finite maximum size so that the process of accumulating this unrecognized data will not take forever.

The finiteness needs also to apply to time. That is, the amount of time to wait while parsing 1 message from the time it gets any data to the time when it gives up must also be finite. So we can hang waiting for 0 bytes, but once we get at least 1 byte, then we start a timer, and if we don't any more bytes, so that we can't finish the parse within the time limit, we should give up. Each time we get some data, even just 1 more byte, the timer should reset, so it's not the total time to parse a message, it's the delay time between receipts of data for a message.

So, imagine the hard bound on max element size was 2K. But the stream blocks trying to get more data to find a terminator, and it hasn't reached the 2K limit yet. How long should we wait? 1 would claim something between 10 seconds and a minute is probably about right.

The key is that after some timeout, the connection is broken, so that daffodil gets an end-of-data for the parser, and both daffodil (listening and reading from) and the sender (connecting and writing to) get a broken connection and have to go around a loop and re-establish the connection again.

A key principle is if you timeout on a read request, that's it for that stream. The connection needs to be re-established from both parties.

Non-Solution: Full Message Handoffs - Using Delimiters or Headers

It has been suggested that the sending system should somehow separate messages so that we can tell when reading where one message ends and the next begins without having to parse the data.

This can be a valuable technique, but is not a general solution to the problem.

First, not all data can be delimited - binary data generally cannot support delimiters as any given delimiter pattern could appear within legal binary data.

Even when data is textual and can be delimited, there may not be any characters that can be reserved for delimiting. That is, the textual format may allow any character string at all.

Because delimiters don't work for binary data, one must instead encapsulate the data or precede each message with a header that gives the length of the message. This is a common strategy and there are many semi-standard message header formats such as mil-std-2045, JREAP-C, NACT, and SIMPLE which are used with the binary military messaging formats.

One difficulty here is that there remains a problem if the data is corrupted for the header information. We have presumably made the problem simpler because the header format is simpler than the general data format. But we sill have the challenges if the header, with the length information, has been corrupted in transport.

In addition, somebody has to "go first" and be the process that parses raw data messages "the hard way" so that standard headers can be created and the messages enhanced with those headers. It makes sense that this first processing stage is a parser, e.g., DFDL/Daffodil.

So ultimately, headers/delimiters aren't a solution.

Page tree