Proposal: Pretty Print Safe XML Output

We need a tunable that affects the way we project the DFDL infoset into the textual XML representation.

The problem:

output XML from parsing cannot be pretty printed for readability, differencing, etc. without risk of changing the whitespace which affects the values of elements of type xs:string.
re-reading the XML converts CRLF → LF and isolated CR → LF. this changes the values, and even lengths of strings.
scripts or IDE tools will often re-indent or pretty-print XML data, which corrupts it.

We have several ways the daffodil infoset can become XML Text.

via the XMLTextInfosetOutputter
via the ScalaXMLInfosetOutputter which then is converted to text
via the JDOMInfosetOutputter which is then converted to text
via the W3CDOMInfosetOutputter which is then converted to text

The below deals only with the XMLTextInfosetOutputter. Analogous changes are needed for the other InfosetOutputters.

Naive solutions to this problem replace all whitespace characters by corresponding charater entities.

This is unacceptable as long multi-line strings become one giant line which is no longer human friendly as one can no longer manipulate such XML with standard text editors.

Solutions must maintain, to the extent possible, human friendliness of the XML, which includes the ability to examine and change the XML using ordinary text editors.

Proposed Solution

We need a tunable to enable new pretty-print-safe XML output

- tunable name: xmlOutputStyle
- values is a whitespace separated list of tokens drawn from this set.
  - "default" (Current behavior - ok if data is not being pretty printed, or will not be re-read in, or if whitespace is fungible in the actual data format),
  - "prettyPrintSafe" - preserves the XML Infoset exactly including whitespace characters. This XML can be pretty printed without indentation changes modifying element values.
  - other values are reserved for future use.

Assumptions & Limitations

We assume pretty printers must obey only a small set of constraints on how they inject whitespace for indenting, or line breaking:

Whitespace is never inserted before, after, or within a <![CDATA[ ....]]> region
Lines are only ever broken at existing whitespace, which implies never between character entities.

It follows from that, if all significant whitespace is within CDATA regions, the data can be pretty printed and the significant whitespace is unaffected.

For example: this reformatting is not allowed. These are not equivalent.

<foo><![CDATA[some stuff]]></foo>

<!-- reformatted to --> 

<foo>
  <![CDATA[some stuff]]>
</foo>

Algorithm

assumes text is all XML-legal characters
- so remapping of things like NUL -> E000 and Ctrl-A -> E001 is already done.
- see: https://daffodil.apache.org/infoset/ section "XML Illegal Characters"
- see also: Daffodil source code object XMLUtils.remapXMLIllegalCharToPUA and other methods that invert this conversion.
assumes we know what is a string and what is not a string, where whitespace around the value can be fungible.
- requires the infoset outputter to have access to the primtive type at the time it it outputting the string.
  - ex: <someHexBinary xsi:type="xs:hexBinary"> AF29B3 </someHexBinary> where the whitespace should/does not matter.
  - ex: <someDouble xsi:type="xs:double"> 6.847 </someDouble> again the whitespace does not matter.
  - NOTE: should verify that infoset inputters do not trip over such whitespace around non-string simple values.
  - NOTE: consider DAFFODIL-182 could also be addressed in this same change set - by adding another token to the xmlOutputStyle 'addXSITypes' in which case the infoset outputter would then also add the xsi:type attributes to the simple elements.

algorithm steps
- for each element of type simple string
- replace all CR with ""
- replace "]]>" by "]]>"
- replace any characters remapped into the PUA by character entities: E.g., so the 0xE000 for a NUL will become ''
- split data into runs of character entities separating runs of non-character entities
- for each split section of non-character entities:
  - surround with CDATA bracketing
- reassemble string concatenating all runs

The resulting string is alternating CDATA bracketed regions and runs of 1 or more character entities. The only whitespace is within CDATA bracketed sections.

Examples

This illustrates how whitespace characters appear only within CDATA brackets for simple text, and how CDATA and standard escaping work together.

data: "   'some' stuff   here &#xE000; and ]]> even"

xml: <foo><![CDATA[   'some' stuff   here ]]>&#xE000;<![CDATA[ and ]]]]>&gt;<![CDATA[ even]]></foo>

This illustrates how enumerated tokens with no whitespace still require this CDATA treatment. But keep in mind this is only for elements of primitive type xs:string

data "NO_WHITESPACE_AT_ALL"

xml: <foo><![CDATA[NO_WHITESPACE_AT_ALL]]></foo>

Even though the string contained no whitespace, the CDATA is needed because a pretty printer on deep indent might otherwise create:

<foo>
  NO_WHITESPACE_AT_ALL
</foo>

which would break the string value by inserting whitespace characters into it.

Arguably, the above is annoying enough that users may need the ability to adjust this and not have CDATA around strings that contain no whitespace at all.

Note that for types other than string, it's perfectly ok for a pretty printer to do this. The following are equivalent:

<num>6.847</num>

<num>
  6.847
</num>

However, one must know that the type is numeric for this to be allowed. Our scheme for pretty-print safe format requires use of CDATA for all strings. Any non-string elements do not have CDATA bracketing and so pretty printing can change them all it wants.

This next example shows an ordinary case. Just a string containing ordinary interior whitespace.

data "this contains interior spaces"

xml <foo><![CDATA[this contains interior spaces]]></foo>

This example shows a string with a CRLF embedded in it. The CRLF would immediately follow the letters "CRLF".

data: "this contains a CRLF
line ending"

xml: <foo><![CDATA[this contains a CRLF]]>&#xE00D;<![CDATA[
line ending]]></foo>

Note that if this XML is saved to disk on a typical CRLF-oriented system such as MS-Windows, there will be a CRLF in the file before the "line ending ]]></foo>" line.

On re-reading this an XML loader will convert this CRLF into a single LF.

If the XML is; however, read by other non-XML-reader software it will contain a full CRLF.

So we're depending on this XML being read by an XML-aware string-reader having the standard behavior of converting CRLF to LF, and isolated CR to LF, interpreting CDATA bracketing, and interpreting XML entities syntax.

Page tree