Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To show these new annnotations at work, a named format that specifies a layering is created via

Code Block
languagexml
<daf:defineFormat name="base64Format" >
    <dfdl:format layerTransformation="base64" layerLengthKind="implicit" />
</dfdl:defineFormat>

...

When a DFDL schema wants to describe say, gzip encoding, then the DFDL annotations might look like this:

Code Block
languagexml
<annotation><appinfo source="http://www.ogf.org/dfdl/">
  <dfdl:defineFormat name="compressed">
    <dfdl:format layerTransform="gzip" layerLengthKind="implicit" />
  </dfdl:defineFormat>
</appinfo></annnotation>

<sequence dfdl:ref="tns:compressed">
  <group ref="tns:compressedGroupContents"/>
</sequence>

...

If we need to determine or verify the length of the layered data, then we must encapsulate the layered sequence in an element so that a path expression can refer to it.

Code Block
languagexml
<annotation><appinfo source="http://www.ogf.org/dfdl/">
  <dfdl:defineFormat name="compressed">
    <dfdl:format layerTransform="gzip" layerLengthKind="implicit" />
  </dfdl:defineFormat>
</appinfo></annnotation>

<sequence>
  ...
  <element name="compressedPayloadLength" type="xs:int"
    dfdl:outputValueCalc='{ dfdl:contentLength(../compressedPayload, "bytes") }'/>

  <element name="compressedPayload" >
    <complexType>
      <sequence dfdl:ref="tns:compressed">
        <group ref="tns:compressedGroupContents"/>
      </sequence>
    </complexType>
  </element>

  <sequence>
    <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
       <dfdl:assert>{ compressedPayloadLength eq dfdl:contentLength(compressedPayload, "bytes") }</dfdl:assert>
    <appinfo></annotation>
  </sequence>
  ....
</sequence>

...

Let's look at an example of two interacting data layer transforms.

Code Block
languagexml
<dfdl:defineFormat name="foldedLines">
  <dfdl:format layerTransform="foldedLines" dfdl:layerLengthKind="delimited"/>
</dfdl:defineFormat>

<dfdl:defineFormat name="base64">
  <dfdl:format layerTransform="base64" layerEncoding="us-ascii" layerLengthKind="delimited" layerTerminator='{ ./marker }'/>
  <!-- note expression above is ./marker, not ../marker -->
</dfdl:defineFormat>

 <xs:sequence dfdl:ref="tns:foldedLines">
   <xs:sequence>
     ...
     ... presumably everything here is textual, and utf-8. FoldedLines only applies sensibly to text.
     ...
     <xs:element name="marker" type="xs:string" .../>
     <xs:sequence dfdl:ref="tns:base64">
        <xs:sequence>
          ...
          ... everything here is parsed against the bytes obtained from base64 decoding
          ... which is itself decoding the output of the foldedLines transform
          ... above. Base64 requires only us-ascii, which is a subset of utf-8.
          ...
        </xs:sequence><!-- end base64 data -->
      </xs:sequence><!-- end base64 sequence e.g., framing, aligning -->
   </xs:sequence><!-- end of foldedLines data-->
 </xs:sequence><!-- end foldedLines sequence e.g., framing, aligning -->

...

Consider this VCALENDAR Data:

Code Block
languagetext
BEGIN:VCALENDAR
PRODID:
VERSION:1.0
BEGIN:VEVENT
DTSTART:20170903T170000Z
DTEND:20170903T173000Z
LOCATION:test location
UID:040000008200E00074C5B7101A82E0080000000010156B50B224D301000000000000000
    01000000083A43200A4E43F4E800BE12703B99BF0
DESCRIPTION;ENCODING=QUOTED-PRINTABLE:=
 Text that will require line folding: Lorem ipsum dolor sit amet, consecte=
 tur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore=
 magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco=
 laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor i=
 n reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla par=
 iatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui =
 officia deserunt mollit anim id est laborum.=0D=0A=0D=0A =0D=0A=0D=0A=0D==
 =0A
SUMMARY:test subject
PRIORITY:3
END:VEVENT
END:VCALENDAR

We want to create a schema that describes this.

In the above there are two behaviors that require use of stream transforms. First is the UID. This has been broken to a maximum line length of 76 characters by way of the folded-lines transformation.

The second is the DESCRIPTION which uses a transformation called QUOTED-PRINTABLE which both achieves short line lengths, and also enables embedding of CR, LF, and other characters at the ends of lines.

The result is that we want this XML Infoset:

Code Block
languagexml
<VCalendar>
  <ProdID>-//Microsoft Corporation//Outlook 15.0 MIMEDIR//EN</ProdID>
  <Version>1.0</Version>
  <VEvent>
    <DTStart></DTStart>
    <DTEnd></DTEnd>
    <Location>test location</Location>
   
 
<UID>040000008200E00074C5B7101A82E0080000000010156B50B224D30100000000000000001000000083A43200A4E43F4E800BE12703B99BF0</UID>
    <Description>
      <Encoding>QUOTED-PRINTABLE</ENCODING>
      <QP/>
     
 <Value>Text that will require line folding: Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut 
labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud 
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum 
dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non 
proident, sunt in culpa qui officia deserunt mollit anim id est 
laborum.&#xEOOD;
&#xEOOD;
 &#xEOOD;
&#xEOOD;
&#xEOOD;
</Value>
    </Description>
    <Summary>test subject</Summary>
    <Priority>3</Priority>
  </VEvent>
</VCalendar>

Notice the CRLFs at the end. The CRs are represented as remapped to Private-Use-Area(PUA) E00D entities.

The DFDL schema for this, including the specification of the layering transform behaviors.

Code Block
languagexml
<xs:schema ....>

 <dfdl:format separatorPosition="infix" lengthKind="delimited" encoding="utf-8"
  occursCountKind="parsed" separator="" sequenceKind="ordered"
  separatorPosition="infix"/>

 <dfdl:defineFormat name="folded">
  <dfdl:format layerTransform="foldedLines" layerLengthKind="delimited" layerEncoding="us-ascii"/>
  <!-- delimited here means to enclosing terminating markup, as no terminator is defined. -->
</dfdl:defineFormat>

<dfdl:defineFormat name="qp">
  <dfdl:format layerTransform="quotedPrintable" layerLengthKind="pattern"
     layerLengthPattern="[^\n]*?(?=(?<!=)\n)"/>
 
 <!-- QPs are terminated by a newline that is not preceded by an =. 
      This final newline is not consumed as part of the content. -->
  
 <!-- Alternatively, the QP transform itself can determine the length 
      by searching for this final newline (but leaving it there).
      In which case the lengthKind would be "implicit" -->
</dfdl:defineFormat>

 <xs:element name="VCalendar" dfdl:initiator="BEGIN:VCALENDAR%NL;" dfdl:terminator="END:VCALENDAR%NL; END:VCALENDAR">
  <xs:complexType>
    <xs:sequence dfdl:separator="%NL;" dfdl:sequenceKind="unordered">
      <xs:element name="ProdID" type="xs:string" dfdl:initiator="PRODID:" minOccurs="0" dfdl:formatRef="tns:folded"/>
      <xs:element name="Version" type="xs:string" dfdl:initiator="VERSION:" minOccurs="0" />
      <xs:element name="VEvent" maxOccurs="unbounded" minOccurs="0" dfdl:occursCountKind="parsed"
        dfdl:initiator="BEGIN:VEVENT%NL;" dfdl:terminator="END:VEVENT">
        <xs:complexType>
          <xs:sequence dfdl:separator="%NL;" dfdl:sequenceKind="unordered">
            <xs:element name="DTStart" type="xs:string" dfdl:initiator="DTSTART:" />
            <xs:element name="DTEnd" type="xs:string" dfdl:initiator="DTEND:" />
            <!-- 
              content from here could have long lines, so must be folded 
            -->
            <xs:sequence dfdl:ref="tns:folded">
              <xs:element name="Location" type="xs:string" dfdl:initiator="LOCATION:" minOccurs="0"/>
              <xs:element name="UID" type="xs:string" dfdl:initiator="UID:" minOccurs="0"/>
              <xs:element name="Description" dfdl:initiator="DESCRIPTION:" minOccurs="0">
                <xs:complexType>
                  <xs:sequence>              
                   <xs:element name="Encoding" type="xs:string" 
                               dfdl:initiator="ENCODING=" dfdl:terminator=":" minOccurs="0" />
                     <xs:choice dfdl:choiceDispatchKey="{ if (fn:exists(./Encoding)) then ./Encoding else '' }">
                       <!-- 
                         we inspect the value of the Encoding element and decide what branch of the choice
                         based on it 
                        -->
                       <xs:sequence dfdl:choiceBranchKey="QUOTED-PRINTABLE">
                         dfdl:separator="" dfdl:sequenceKind="unordered">
                         <!--
                          Each branch starts with a distinct dummy element to satisfy the UPA rules of XML Schema
                         -->
                         <xs:element name="QP" type="xs:string" dfdl:inputValueCalc="{ '' }" />
                         <!--
                          Here notice tha tthe layerRef for the qp data is scoped to just this inner element.
                         -->
                         <xs:sequence dfdl:ref="tns:qp">
                           <xs:element name="Value" type="xs:string"/>
                         </xs:sequence><!-- end layer quoted printable -->
                       </xs:sequence>
                       <!-- 
                          repeat the above pattern for the choice branches for the various encodings 
                        -->
                    </xs:choice>
                  </xs:sequence>
                </xs:complexType>
              </xs:element>           
              <xs:element name="Summary" type="xs:string"  dfdl:initiator="SUMMARY:" minOccurs="0"/>
              <xs:element name="Priority" type="xs:string" dfdl:initiator="PRIORITY:" minOccurs="0" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence><!-- end folded layer -->
    </xs:sequence>
  </xs:complexType>
</xs:element>
</xs:schema>

...

Here's some CSV data

Code Block
languagetext
last,first,middle,DOB
smith,robert,brandon,1988-03-24
johnson,john,henry,1986-01-23
jones,arya,cat,1986-02-19

Here's that data gzipped, then base64 encoded.

Code Block
languagetext
H4sICBqITloAA3NpbXBsZUNTVi5jc3YALclBCoAgEIXhvWeZgbSI3Eb7zjCmoWEjjG66fQZt3g/v
y1QbnEn63sn7HGDbV1Xv1CJIcUEaOCH2hUHbZcFhRDOpq0Su/foKMbA8n844aDRjVw4VSB6Cg9ov
BrVVL2G135RuAAAA

The schema that describes the CSV data without the stream transforms is this:

Code Block
languagexml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions"
  xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" xmlns:ex="http://example.com"
  targetNamespace="http://example.com" elementFormDefault="unqualified">

  <xs:include schemaLocation="built-in-formats.xsd" />

  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
      <dfdl:format ref="ex:daffodilTest1" separator="" initiator=""
        terminator="" leadingSkip='0' textTrimKind="none" initiatedContent="no"
        alignment="implicit" alignmentUnits="bits" trailingSkip="0" ignoreCase="no"
        separatorPosition="infix" occursCountKind="implicit"
        emptyValueDelimiterPolicy="both" representation="text" textNumberRep="standard"
        lengthKind="delimited" encoding="ASCII" />
    </xs:appinfo>
  </xs:annotation>

    <xs:element name="file" type="ex:fileType"/>

    <!-- broke this up to provide some resuable types and groups here -->

    <xs:complexType name="fileType">
      <xs:group ref="ex:fileTypeGroup"/>
    </xs:complexType>

    <xs:group name="fileTypeGroup">
      <xs:sequence dfdl:separator="%NL;" dfdl:separatorPosition="postfix">
        <xs:element name="header" minOccurs="0" maxOccurs="1"
          dfdl:occursCountKind="implicit">
          <xs:complexType>
            <xs:sequence dfdl:separator=",">
              <xs:element name="title" type="xs:string" maxOccurs="unbounded" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element name="record" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence dfdl:separator=",">
              <xs:element name="item" type="xs:string" maxOccurs="unbounded"
                dfdl:occursCount="{ fn:count(../../header/title) }"
                dfdl:occursCountKind="expression" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:group>

</xs:schema>

We can annotate this schema with additional stream transform information to enable it to describe the base64 encoded, compressed data.

One easy way to do this is by modifying the complex type definition for fileType to this:

Code Block
languagexml
<xs:complexType name="fileType">
  <xs:sequence dfdl:ref="ex:base64">
    <xs:sequence dfdl:ref="ex:gzip">
      <xs:group ref="ex:fileTypeGroup"/>
    </xs:sequence>
  </xs:sequence>
</xs:complexType>

Along with that we need the definitions of these named stream formats:

Code Block
languagexml
<dfdl:defineFormat name="base64">
   <dfdl:format layerTransform="base64" layerLengthKind="implicit" />
</dfdl:defineFormat>

<dfdl:defineFormat name="gzip">
   <dfdl:format layerTransform="gzip" layerLengthKind="implicit"/>
</dfdl:defineFormat>

...