Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Improved discussion added openDFDL hexWords example ref.

...

Any "message format" DFDL schema has to not consume to end-of-stream in the case of malformed data. That is, a parse() call cannot hang.

The DFDL schema has to be designed to recover by having an element definition for an "unrecognized" element, available as a last choice branch.

...

The question is, if the data is unrecognized, how do we know how much to capture from the input stream?

The answer depends on the nature of the DFDL schema-described format, and whether we have any information about how much malformed data to consume from the data stream.

Failure Modes

The principle here has to be that the amount of data consumed is always at least bounded by some upper limit.

...

In case (1a), we can determine the length, so we can consume that much into a unrecognized element. This is the easy case. We do need to deal with the case where the given length is entirely unreasonable however, as corrupted data, the corruption of which falls exactly within the length information of the header, may have a vast length value that is entirely unreasonable. So the length needs to be limited to some finite maximum. If the length exceeds this then we can treat the case the same as the Unknown Payload- Unknown Length case below, or the Garbage data case below.

Unknown Payload - Unknown Length

...

If this comes back length 0, then I think it makes sense to use daffodil's lookahead feature (not in standard DFDL as yet), to grab say the first 8 bytes of the message, for use in the unrecognized element. We may want to distinguish unrecognized with known length, unrecognized with heuristic length, and unrecognized starting with these 8 bytes. That may be overkill.

If there is no reasonable heuristic to determine a likely run of malformed data, then treat as in the Garbage data case below.

Garbage In Data Stream

In case (2), the data is garbage, we get a failed parse. We should log something (in software or via a DFDL recoverableSo long as the data stream is not at its end, what we want to do is, advance 1 unit of data. This might be 1 byte for byte-oriented data formats, but could be smaller. It could be 4-bits for hex/nibble oriented data, or even just 1 bit for bit-granularity data. For this discussion we'll assume 1 byte.

So we want to consume 1 bytes and try the parse again.

This can still be done in the DFDL schema, by having a final choice branch that consists of:

Code Block
<element name="malformed" type="tns:invalidByte"/>

where the type tns:invalidByte is defined as:

Code Block
<simpleType name="invalidByte" dfdl:representation="binary" dfdl:lengthKind="implicit">
    <restriction base="xs:unsignedByte">
        <maxExclusive value="0"/> <!-- can never pass. Always will be invalid. -->
    </restriction>
</simpleType>

This will create a "well-formed" infoset containing these elements that are named "malformed". One byte of data will be consumed from the input stream.

Validation against the DFDL schema's facets will show that this data is invalid. However, in this case that invalidity really means the data is malformed. Based on this invalidity, an application can discard (possibly logging the issue) and parse again, but it is now starting again 1 byte later. It will continue to consume bytes in this way -error assert - such time as Daffodil supports those), advance 1 byte, and try again. This will create a flood of log messages until it finds something that can parse successfully again, but if there are well formed messages later in the stream, then it should re-synchronize with them. 

A good design here would log a start-of-bad-data message, then enter a sub-loop of parsing that expects failures, counts and captures the failed bytes, and upon a successful parse or reaching N failures, emits a avoid logging for every byte and do some aggregation of consecutive malformed bytes into a log message about a span run of such bad data. This is way better than 1 log message per bad byte, but still allows 1 byte at a time to be consumed so as to eventually re-synchronizebytes.  There might also be a detection of a run longer than N bytes which could trigger a higher level failure which actually stops the processing (perhaps by breaking the connection).

An assumption here is that the parses will likely fail fast until finally one succeeds. Formats that are not highly selective about identifying bad data won't work so well here.

This technique is used in an example on openDFDL: https://github.com/OpenDFDL/examples/tree/master/hexWordsIt is considered to be a success if a parse returns a message infoset even if that contains an unrecognized payload element.

Timeouts - Finite Data Size and Finite Delay

Recovery For the 1a and 1b techniques described above, recovery requires that when data is unrecognized and doesn't match the schema, that we can parse an "unrecognized" element which has a finite maximum size so that the process of accumulating this unrecognized data will not take forever.

The finiteness needs also to apply to time. That is, the amount of time to wait while parsing 1 message from the time it gets any data to the time when it gives up must also be finite. So we can hang waiting for 0 bytes, but once we get at least 1 byte, then we start a timer, and if we don't any more bytes, so that we can't finish the parse within the time limit, we should give up. Each time we get some data, even just 1 more byte, the timer should reset, so it's not the total time to parse a message, it's the delay time between receipts of data for a message.

So, imagine the hard bound on max element size was 2K. But the stream blocks trying to get more data to find a terminator, and it hasn't reached the 2K limit yet.  How long should we wait? 1 would claim something between 10 seconds and a minute is probably about right.

The key is that after some timeout, the connection is broken, so that daffodil gets an end-of-data for the parser, and both daffodil (listening and reading from) and the sender (connecting and writing to) get a broken connection and have to go around a loop and re-establish the connection again.

A key principle is if you timeout on a read request, that's it for that stream. The connection needs to be re-established from both parties.

Note that this timeout-related technique does not apply to a recovery technique per the Garbage In The  Data technique described above, since that technique waits for exactly 1 byte (or less) of data only at a time.

Non-Solution: Full Message Handoffs - Using Delimiters or Headers

...