Proposal: Pretty Print Safe XML Output

We need a tunable that affects the way we project the DFDL infoset into the textual XML representation.

The problem:

output XML from parsing cannot be pretty printed for readability, differencing, etc. without changing the whitespace which affects the values of elements of type xs:string.
re-reading the XML converts CRLF → LF and isolated CR → LF. this changes the values, and even lengths of strings.

We have several ways the daffodil infoset can become XML Text.

via the XMLTextInfosetOutputter
via the ScalaXMLInfosetOutputter which then is converted to text
via the JDOMInfosetOutputter which is then converted to text
via the W3CDOMInfosetOutputter which is then converted to text

The below deals only with the XMLTextInfosetOutputter.

Proposed Solution

We need a tunable to enable new pretty-print-safe XML output

- name xmlTextInfosetOutputterTextStyle
- values are
  - "default" (Current behavior - ok if data is not being pretty printed, or will not be re-read in, or if whitespace is fungible in the actual data format),
  - "prettyPrintSafe" - preserves the XML Infoset exactly including whitespace characters. This XML can be pretty printed without indentation changes modifying element values.

Algorithm

assumes text is all XML-legal characters
- so remapping of things like NUL -> E000 and Ctrl-A -> E001 is already done.
assumes we know what is a string and what is something else where whitespace around the value can be fungible.
- may require xsi:type="xs:string" to recognize strings (DAFFODIL-182), or at least for the infoset outputter to have access to the type.
  - ex: <someHexBinary xsi:type="xs:hexBinary"> AF29B3 </someHexBinary> where the whitespace should/does not matter.
  - ex: <someDouble xsi:type="xs:double"> 6.847 </someDouble> again the whitespace does not matter.
  - NOTE: verify that infoset inputters do not trip over such whitespace around non-string simple values.

algorithm steps
- for each element of type simple string
- replace all CR with ""
- replace "]]>" by "]]>"
- maybe: replace any characters remapped into the PUA by character entities: E.g., so the 0xE000 for a NUL will become ''
- maybe/option: replace any character with unicode code point > N with numeric character entity. (for N = 255 or maybe N = 127 for ascii-only mode)
  - this would be Single-Byte charset, or even ASCII-only mode.
- split data at sequence of 1 or more XML entity (e.g., & or " or } ) keeping track of the sequences of entities for each split.
- for each split section
  - if it contains any whitespace, surround with CDATA bracketing.
- reassemble string concatenating all segments with the spliting sequences of entities between them.

Examples

This illustrates how whitespace characters appear only within CDATA brackets for simple text, and how CDATA and standard escaping work together.

data: "   'some' stuff   here &#xE000; and ]]> even"

xml: <foo><![CDATA[   'some' stuff   here ]]>&#xE000;<![CDATA[ and ]]]]>&gt;<![CDATA[ even]]></foo>

This illustrates how enumerated tokens with no whitespace don't require this CDATA treatment.

data "NO_WHITESPACE_AT_ALL"

xml: <foo>NO_WHITESPACE_AT_ALL</foo>

TBD: is this sufficient? May need CDATA brackets here too to avoid deep indents from wrapping this like:

<foo>
  NO_WHITESPACE_AT_ALL
</foo>

This example shows an ordinary string containing ordinary interior whitespace.

data "this contains interior spaces"

xml <foo><![CDATA[this contains interior spaces]]></foo>

This example shows a string with a CRLF embedded in it. The CRLF would immediately follow the letters "CRLF".

data: "this contains a CRLF
line ending"

xml: <foo><![CDATA[this contains a CRLF]]>&#xD;<![CDATA[
line ending]]></foo>

Note that if this XML is saved to disk on a typical CRLF-oriented system such as MS-Windows, there will be a CRLF in the file before the "line ending ]]></foo>" line.

On re-reading this an XML loader will convert this CRLF into a single LF. If the XML is; however, read by other non-XML-reader software it will contain a full CRLF. So we're depending on this XML being read by an XML-aware string-reader having the standard behavior of converting CRLF to LF, and CR to LF, interpreting CDATA bracketing, and interpreting XML entities syntax.

Page tree

Proposal: Pretty Print Safe XML Output

The problem:

Proposed Solution

Algorithm

Examples