Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. output XML from parsing cannot be pretty printed for readability, differencing, etc. without risk of changing the whitespace which affects the values of elements of type xs:string. 
  2. re-reading the XML converts CRLF → LF and isolated CR → LF. this changes the values, and even lengths of strings. 
  3. scripts or IDE tools will often re-indent or pretty-print XML data, which corrupts it.

We have several ways the daffodil infoset can become XML Text.

...

The below deals only with the XMLTextInfosetOutputter.  Analogous changes are needed for the other InfosetOutputters.

Naive solutions to this problem replace all whitespace characters by corresponding charater entities. 

This is unacceptable as long multi-line strings become one giant line which is no longer human friendly as one can no longer manipulate such XML with standard text editors. 

Solutions must maintain, to the extent possible, human friendliness of the XML, which includes the ability to examine and change the XML using ordinary text editors. 

Proposed Solution

We need a tunable to enable new pretty-print-safe XML output

    • tunable name: xmlOutputStyle
    • values are is a whitespace separated list of tokens drawn from this set.  
      • "default" (Current behavior - ok if data is not being pretty printed, or will not be re-read in, or if whitespace is fungible in the actual data format), 
      • "prettyPrintSafe" - preserves the XML Infoset exactly including whitespace characters. This XML can be pretty printed without indentation changes modifying element values. 
      • other values are reserved for future use.

We assume pretty printers must obey only a small set of constraints on how they inject whitespace for indenting, or line breaking:

  1. Whitespace is never inserted before, after, or within a <![CDATA[ ....]]> region
  2. Lines are only ever broken at existing whitespace, which implies never between character entities. 

It follows from that, if all significant whitespace is within CDATA regions, the data can be pretty printed and the significant whitespace is unaffected.

Algorithm

  • assumes text is all XML-legal characters
    • so remapping of things like NUL -> E000 and Ctrl-A -> E001 is already done.
    • see: https://daffodil.apache.org/infoset/ section "XML Illegal Characters"
    • see also: Daffodil source code object XMLUtils.remapXMLIllegalCharToPUA and other methods that invert this conversion.
  • assumes we know what is a string and what is something else not a string, where whitespace around the value can be fungible. 
    • may require xsi:type="xs:string" to recognize strings (DAFFODIL-182), or at least for requires the infoset outputter to have access to the primtive type at the time it it outputting the string
      • ex: <someHexBinary xsi:type="xs:hexBinary">  AF29B3 </someHexBinary> where the whitespace should/does not matter.
      • ex: <someDouble xsi:type="xs:double">    6.847   </someDouble> again the whitespace does not matter.
      • NOTE: should verify that infoset inputters do not trip over such whitespace around non-string simple values
      • NOTE: consider DAFFODIL-182 could also be addressed in this same change set - by adding another token to the xmlOutputStyle 'addXSITypes' in which case the infoset outputter would then also add the xsi:type attributes to the simple elements
  • algorithm steps
    • for each element of type simple string
    • replace all CR with "&#xE00D;"
    • replace "]]>" by "]]&gt;" 
    • replace any characters remapped into the PUA by character entities: E.g., so the 0xE000 for a NUL will become '&#xE000;'
    • split data at sequence of 1 or more XML entity (e.g., &amp; or &quot; or &#x7d; ) keeping track of the sequences of entities for each split. into runs of character entities separating runs of non-character entities 
    • for each split section of non-character entities:
      • surround with CDATA bracketing
    • reassemble string concatenating all segments with the spliting sequences of entities between them.  runs

The resulting string is alternating CDATA bracketed regions and runs of 1 or more character entities. The only whitespace is within CDATA bracketed sections. 

...

Code Block
<foo>
  NO_WHITESPACE_AT_ALL
</foo>

which would break the string value by inserting whitespace characters into it. 

Arguably, the above is annoying enough that users may need the ability to adjust this and not have CDATA around strings that contain no whitespace at all. 

Note that for types other than string, it's perfectly ok for a pretty printer to do this. The following are equivalent:

Code Block
<num>6.847</num>

<num>
  6.847
</num>

However, one must know that the type is numeric for this to be allowed. Our scheme for pretty-print safe format requires use of CDATA for all strings. Any non-string elements do not have CDATA bracketing and so pretty printing can change them all it wants. 

This next example shows an ordinary case. Just a string containing ordinary interior whitespace. 

...

If the XML is; however, read by other non-XML-reader software it will contain a full CRLF.

So we're depending on this XML being read by an XML-aware string-reader having the standard behavior of converting CRLF to LF, and isolated CR to LF, interpreting CDATA bracketing, and interpreting XML entities syntax. 

...