Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The below deals only with the XMLTextInfosetOutputter.  Analogous changes are needed for the other InfosetOutputters.

Proposed Solution

We need a tunable to enable new pretty-print-safe XML output

    • name xmlTextInfosetOutputterTextStylexmlOutputStyle
    • values are 
      • "default" (Current behavior - ok if data is not being pretty printed, or will not be re-read in, or if whitespace is fungible in the actual data format), 
      • "prettyPrintSafe" - preserves the XML Infoset exactly including whitespace characters. This XML can be pretty printed without indentation changes modifying element values. 

...

  • algorithm steps
    • for each element of type simple string
    • replace all CR with "&#xD#xE00D;"
    • replace "]]>" by "]]>" 
    • maybe: replace any characters remapped into the PUA by by character entities: E.g., so the 0xE000 for a NUL will become ''maybe/option: replace any character with unicode code point > N with numeric character entity. (for N = 255 or maybe N = 127 for ascii-only mode) this would be Single-Byte charset, or even ASCII-only mode.
    • split data at sequence of 1 or more XML entity (e.g., & or " or } ) keeping track of the sequences of entities for each split. 
    • for each split section
      • if it contains any whitespace, surround with CDATA bracketing.
    • reassemble string concatenating all segments with the spliting sequences of entities between them.  

The resulting string is alternating CDATA bracketed regions and character entities. The only whitespace is within CDATA bracketed sections. 

Examples

This illustrates how whitespace characters appear only within CDATA brackets for simple text, and how CDATA and standard escaping work together.

...

This illustrates how enumerated tokens with no whitespace don't still require this CDATA treatment. But keep in mind this is only for elements of primitive type xs:string

Code Block
data "NO_WHITESPACE_AT_ALL"

xml: <foo>NO<foo><![CDATA[NO_WHITESPACE_AT_ALL<ALL]]></foo>

TBD: is this sufficient? May need CDATA brackets here too to avoid deep indents from wrapping this likeEven though the string contained no whitespace, the CDATA is needed because a pretty printer on deep indent might otherwise create:

Code Block
<foo>
  NO_WHITESPACE_AT_ALL
</foo>

This next example shows an ordinary case. Just a string containing ordinary interior whitespace. 

...

Code Block
data: "this contains a CRLF
line ending"

xml: <foo><![CDATA[this contains a CRLF]]>&#xD#xE00D;<![CDATA[
line ending]]></foo>

...

On re-reading this an XML loader will convert this CRLF into a single LF.

If the XML is; however, read by other non-XML-reader software it will contain a full CRLF. So we're depending on this XML being read by an XML-aware string-reader having the standard behavior of converting CRLF to LF, and isolated CR to LF, interpreting CDATA bracketing, and interpreting XML entities syntax. 

...