Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
<daf:defineFormat name="base64Format" >
    <dfdl:format layerTransformation="base64" layerLengthKind="implicit" />
</dafdfdl:defineLayer>defineFormat>

These properties are only relevant to xs:sequence constructs, and so a dfdl:ref to a named format using these layer properties is only sensible from a dfdl:sequence or on an xs:sequence.

...

A data layer is conceptually a stream of bytes. It can be an input layer for parsing, an output layer for unparsing.
Use of the term "stream" here is consistent with java's use of stream as in java.io.InputStream and java.io.OutputStream. These are sources and sinks of bytes. If one wants to decode characters from them you must do so by specifying the encoding explicitly.

A layer transform is a transformation that creates one layer of bytes from another. An underlying layer is encapsulated by a transformation to create an overlying layer.

When parsing, reading from the overlying layer causes reading of data from the underlying layer, which data is then transformed and becomes the bytes of the overlying layer returned from the read.

The layer properties apply to the underlying layer data and indicate how to identify its bounds/length, and if a layer transform is textual, what encoding is used to interpret the underlying bytes.

Some transformations are naturally binary bytes to bytes. Data decompress/compress are the typical example here. When parsing, the overlying layer's bytes are the result of decompression of the underlying layer's bytes.

If a transform requires text, then a dafdfdl:layer format encoding must be defined. For example, base64 is a transform that creates bytes from text. Hence, a layer encoding is needed to convert the underlying layer of bytes into text, then the base64 decoding occurs on that text, which produces the bytes of the overlying layer.

We think of some transforms as text-to-text. Line folding/unfolding is one such. Lines of text that are too long are wrapped by inserting a line-ending and either a space or tab. As a DFDL layer transform this line folding transform requires an encoding. The underlying bytes are decoded into characters according to the encoding. Those characters are divided into lines, and the line unfolding (for parsing) is done to create longer lines of data, the resulting data is then encoded from characters back into bytes using the same encoding.

(There may be opportunities to optimize/shortcut these transformations if the overlying layer is the data layer for an element with scannable text representation using the same character set encoding. The recoversion back to bytes, only to have to then decode bytes to characters of the same encoding again is overhead that can be avoided.)

DFDL can describe a mixture of character set decoding/encoding and binary value parsing/unparsing against the same underlying data representation; hence, the underlying data layer concept is always one of bytes.

(Note: bytes suffices even for mil-std-2045 which can hold a compressed VMF payload. This payload element is always byte aligned even in mil-std-2045, a very bit-oriented format. As of this writing we have no examples of layer transforms that require bit granularity; hence, this is a byte-oriented proposal.)

Daffodil parsing begins with a default standard data input stream. Unparsing begins with a default standard output stream. These are the ultimate underlying layer.

...

Code Block
<xs:schema ....>

 <dfdl:format separatorPosition="infix" lengthKind="delimited" encoding="utf-8"
  occursCountKind="parsed" separator="" sequenceKind="ordered"
  separatorPosition="infix"/>

 <daf <dfdl:defineLayerdefineFormat name="folded">
  <daf<dfdl:layerformat transformlayerTransform="foldedLines" lengthKindlayerLengthKind="delimited" encodinglayerEncoding="us-ascii"/>
  <!-- delimited here means to enclosing terminating markup, as no terminator is defined. -->
</dafdfdl:defineLayer>defineFormat>

<daf<dfdl:defineLayerdefineFormat name="qp">
  <daf<dfdl:layerformat transformlayerTransform="quotedPrintable" lengthKindlayerLengthKind="pattern"
     lengthPatternlayerLengthPattern="[^\n]*?(?=(?<!=)\n)"/>
 
 <!-- QPs are terminated by a newline that is not preceded by an =. 
      This final newline is not consumed as part of the content. -->
  
 <!-- Alternatively, the QP transform itself can determine the length 
      by searching for this final newline (but leaving it there).
      In which case the lengthKind would be "implicit" -->
</dafdfdl:defineLayer>defineFormat>

 <xs:element name="VCalendar" dfdl:initiator="BEGIN:VCALENDAR%NL;" dfdl:terminator="END:VCALENDAR%NL; END:VCALENDAR">
  <xs:complexType>
    <xs:sequence dfdl:separator="%NL;" dfdl:sequenceKind="unordered">
      <xs:element name="ProdID" type="xs:string" dfdl:initiator="PRODID:" minOccurs="0" dafdfdl:layerRefformatRef="tns:folded"/>
      <xs:element name="Version" type="xs:string" dfdl:initiator="VERSION:" minOccurs="0" />
      <xs:element name="VEvent" maxOccurs="unbounded" minOccurs="0" dfdl:occursCountKind="parsed"
        dfdl:initiator="BEGIN:VEVENT%NL;" dfdl:terminator="END:VEVENT">
        <xs:complexType>
          <xs:sequence dfdl:separator="%NL;" dfdl:sequenceKind="unordered">
            <xs:element name="DTStart" type="xs:string" dfdl:initiator="DTSTART:" />
            <xs:element name="DTEnd" type="xs:string" dfdl:initiator="DTEND:" />
            <!-- 
              content from here could have long lines, so must be folded 
            -->
            <xs:sequence dafdfdl:layerRefref="tns:folded">
              <xs:element name="Location" type="xs:string" dfdl:initiator="LOCATION:" minOccurs="0"/>
              <xs:element name="UID" type="xs:string" dfdl:initiator="UID:" minOccurs="0"/>
              <xs:element name="Description" dfdl:initiator="DESCRIPTION:" minOccurs="0">
                <xs:complexType>
                  <xs:sequence>              
                   <xs:element name="Encoding" type="xs:string" 
                               dfdl:initiator="ENCODING=" dfdl:terminator=":" minOccurs="0" />
                     <xs:choice dfdl:choiceDispatchKey="{ if (fn:exists(./Encoding)) then ./Encoding else '' }">
                       <!-- 
                         we inspect the value of the Encoding element and decide what branch of the choice
                         based on it 
                        -->
                       <xs:sequence dfdl:choiceBranchKey="QUOTED-PRINTABLE">
                         dfdl:separator="" dfdl:sequenceKind="unordered">
                         <!--
                          Each branch starts with a distinct dummy element to satisfy the UPA rules of XML Schema
                         -->
                         <xs:element name="QP" type="xs:string" dfdl:inputValueCalc="{ '' }" />
                         <!--
                          Here notice tha tthe layerRef for the qp data is scoped to just this inner element.
                         -->
                         <xs:sequence dfdl:layerRefref="tns:qp">
                           <xs:element name="Value" type="xs:string"/>
                         </xs:sequence><!-- end layer quoted printable -->
                       </xs:sequence>
                       <!-- 
                          repeat the above pattern for the choice branches for the various encodings 
                        -->
                    </xs:choice>
                  </xs:sequence>
                </xs:complexType>
              </xs:element>           
              <xs:element name="Summary" type="xs:string"  dfdl:initiator="SUMMARY:" minOccurs="0"/>
              <xs:element name="Priority" type="xs:string" dfdl:initiator="PRIORITY:" minOccurs="0" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence><!-- end folded layer -->
    </xs:sequence>
  </xs:complexType>
</xs:element>
</xs:schema>

...

Code Block
<xs:complexType name="fileType">
  <xs:sequence dafdfdl:layerRefref="ex:base64">
    <xs:sequence dafdfdl:layerRefref="ex:gzip">
      <xs:group ref="ex:fileTypeGroup"/>
    </xs:sequence>
  </xs:sequence>
</xs:complexType>

Along with that we need the definitions of these named stream formats:

Code Block
<daf<dfdl:defineLayerdefineFormat name="base64">
   <daf<dfdl:layerformat transformlayerTransform="base64" lengthKindlayerLengthKind="implicit" />
</dafdfdl:defineLayer>defineFormat>

<daf<dfdl:defineLayerdefineFormat name="gzip">
   <daf<dfdl:layerformat transformlayerTransform="gzip" lengthKindlayerLengthKind="implicit"/>
</dafdfdl:defineLayer>defineFormat>

These transforms, with lengthKind implicit, are assumed to behave as "self-delimiting" meaning they know how much data to consume.

The gzip transform must consume every byte produced by the base64 transform - that is, when the gzip transform is finished, it must check that no more data can be pulled from the base64 transform. The actual parsing of data is consuming data from the gzip transform, and when that has finished parsing, it must insure all data has been pulled from the gzip stream.

...

  • allows stacking transforms one on top of another. So you can have base64 encoded compressed data as the payload representation of
    a child element within a larger element.

  • allows specifying properties of the underlying data layers separately from the properties of the logical data.

  • scopes the transforms over a xs:sequence body only.prevents inadvertent lexical scoping of a layering transform from a lexically enclosing top level format annotation

  • Avoids new annotation elements with particulars about scoping.
  • Simple: doesn't add new functions for layering use when existing dfdl:contentLength will already handle it.
  • Complex cases - e.g., initiator before layered data, are handled by encapsulating the layered sequence in another sequence or element that carries the initiator.
  • Layer annotations are only about the determining of the length of the layered region, and the algorithm for transforming the data.
  • Layer transforms have mandatory layer alignment (1 byte for now)

Open design issues

  • Parameterization of transform algorithms - many algorithms will have variations that can be controlled by parameters; however, whether there needs to be a parameterization method, or there can just be a large number of individual transforms each having specific configurations of parameters... it is unclear what is truly required and experience with these concepts will be needed before there will be enough information to proposed ideas here.
  • Debug and trace impact, and how to provide visibility to what is going on when an error occurs in the middle of parsing/unparsing when transforms are in use. E.g., the bit/byte position where a run time parse error occurs would be in some transformed stream, not the underlying stream. I suspect some experience with these transform concepts will be needed before there will be enough information to propose ideas here.

...