Proposal: Pretty Print Safe XML Output

We need a tunable that affects the way we project the DFDL infoset into the textual XML representation.

The problem:

output XML from parsing cannot be pretty printed for readability, differencing, etc. without changing the whitespace which affects the values of elements of type xs:string.
re-reading the XML converts CRLF → LF and isolated CR → LF. this changes the values, and even lengths of strings.

We have several ways the daffodil infoset can become XML Text.

via the XMLTextInfosetOutputter
via the ScalaXMLInfosetOutputter which then is converted to text
via the JDOMInfosetOutputter which is then converted to text
via the W3CDOMInfosetOutputter which is then converted to text

The below deals only with the XMLTextInfosetOutputter. Analogous changes are needed for the other InfosetOutputters.

Proposed Solution

We need a tunable to enable new pretty-print-safe XML output

- name xmlOutputStyle
- values are
  - "default" (Current behavior - ok if data is not being pretty printed, or will not be re-read in, or if whitespace is fungible in the actual data format),
  - "prettyPrintSafe" - preserves the XML Infoset exactly including whitespace characters. This XML can be pretty printed without indentation changes modifying element values.

Algorithm

assumes text is all XML-legal characters
- so remapping of things like NUL -> E000 and Ctrl-A -> E001 is already done.
assumes we know what is a string and what is something else where whitespace around the value can be fungible.
- may require xsi:type="xs:string" to recognize strings (DAFFODIL-182), or at least for the infoset outputter to have access to the type.
  - ex: <someHexBinary xsi:type="xs:hexBinary"> AF29B3 </someHexBinary> where the whitespace should/does not matter.
  - ex: <someDouble xsi:type="xs:double"> 6.847 </someDouble> again the whitespace does not matter.
  - NOTE: verify that infoset inputters do not trip over such whitespace around non-string simple values.

algorithm steps
- for each element of type simple string
- replace all CR with ""
- replace "]]>" by "]]>"
- replace any characters remapped into the PUA by character entities: E.g., so the 0xE000 for a NUL will become ''
- split data at sequence of 1 or more XML entity (e.g., & or " or } ) keeping track of the sequences of entities for each split.
- for each split section
  - surround with CDATA bracketing
- reassemble string concatenating all segments with the spliting sequences of entities between them.

The resulting string is alternating CDATA bracketed regions and character entities. The only whitespace is within CDATA bracketed sections.

Examples

This illustrates how whitespace characters appear only within CDATA brackets for simple text, and how CDATA and standard escaping work together.

data: "   'some' stuff   here &#xE000; and ]]> even"

xml: <foo><![CDATA[   'some' stuff   here ]]>&#xE000;<![CDATA[ and ]]]]>&gt;<![CDATA[ even]]></foo>

This illustrates how enumerated tokens with no whitespace still require this CDATA treatment. But keep in mind this is only for elements of primitive type xs:string

data "NO_WHITESPACE_AT_ALL"

xml: <foo><![CDATA[NO_WHITESPACE_AT_ALL]]></foo>

Even though the string contained no whitespace, the CDATA is needed because a pretty printer on deep indent might otherwise create:

<foo>
  NO_WHITESPACE_AT_ALL
</foo>

This next example shows an ordinary case. Just a string containing ordinary interior whitespace.

data "this contains interior spaces"

xml <foo><![CDATA[this contains interior spaces]]></foo>

This example shows a string with a CRLF embedded in it. The CRLF would immediately follow the letters "CRLF".

data: "this contains a CRLF
line ending"

xml: <foo><![CDATA[this contains a CRLF]]>&#xE00D;<![CDATA[
line ending]]></foo>

Note that if this XML is saved to disk on a typical CRLF-oriented system such as MS-Windows, there will be a CRLF in the file before the "line ending ]]></foo>" line.

On re-reading this an XML loader will convert this CRLF into a single LF.

If the XML is; however, read by other non-XML-reader software it will contain a full CRLF. So we're depending on this XML being read by an XML-aware string-reader having the standard behavior of converting CRLF to LF, and isolated CR to LF, interpreting CDATA bracketing, and interpreting XML entities syntax.

Page tree

Proposal: Pretty Print Safe XML Output

The problem:

Proposed Solution

Algorithm

Examples