This page is a collection of notes on how to create DFDL schemas in a way that really helps keep you out of various XSD snarls and complexities. 

As of this writing (2023-02-13) many of the DFDL Schemas we have created do not follow all these conventions perfectly. We have learned as we have gone along.

This set of notes represents best practices after learning from many debugging exercises. 

Avoid Element References and Global Element Declarations

DFDL Schemas should use elementFormDefault="unqualified" (which is the default for XML Schemas).  There's no need for every child element to have a namespace (hence prefix), when the tree they are part of has a namespace prefix somewhere further towards the root which makes the identity of those child elements unambiguous. 

Global elements should be defined only as an assistance for testing the schema. 

Those elements should do nothing more than use a complex type definition.

DFDL schemas should not use element references. 

The content of the schema should always be in a complex type definition. This gives the schema user the choice of what they want to call their elements, whether they want a global element, or to use the schema as a child element within a larger structure, without the burden of introducing global namespace prefix management to their schemas. 

Defining only global types and groups, leaving the global elements for the end-user of the schema provides greater flexibility. 

Hence, the standard start of a DFDL schema is doing to be:

<schema 
  targetNamespace="urn:mySchemaNamespace"
  xmlns:msns="urn:mySchemaNamespace" 
  ... >

... import/include and top level format annotations...

<!-- 
  This one-liner below is the ONLY global element in the entire schema, 
  and schema users can always ignore it and just use the complex type, so 
  they can call the element in their schema whatever they want.

  At the same time this single root allows users to easily 
  test the schema with the daffodil CLI or daffodil-vscode extension, 
  without having to specify a root element in a separate file. 
--> 

<element name="mySchema" type="msns:mySchemaType"/> 

<complexType name="mySchemaType">
     ... the real schema contents is all reachable from here. ...
</complexType>

... other types and groups ...

</schema>

Included files, and imported files that are part of the same schema project should either have no global elements at all, or one, like the above, to facilitate testing. 

But they should always include equivalent complex type or group definitions allowing those global elements to be bypassed/ignored. 

Rationale: This makes schemas more flexible for reuse because it takes no position on element names that the schema user can't avoid if they so choose. 

A second global element can also sometimes be useful for testing against files with multiple data items in it. This second global element would almost always look like:

<element name="mySchemaFile">
  <complexType>
    <sequence>
      <element name="mySchema" type="msns:mySchemaType" maxOccurs="unbounded" dfdl:occursCountKind='implicit'/>
    </sequence>
  </complexType>
</element>

Note how this does not have an element reference in it, but a local element declaration for the mySchema child element. 

Lastly, no other structured data system has anything like element references, so in the interests of being able to use DFDL and transform data into the data models used by other processing fabrics, element references should be avoided. 

Summary: schema files should have zero, one, or at most two global element declarations in them, and those are there for convenient testing, and may be ignored entirely when the schema is reused.

Namespaces, Namespace Prefixes, Import, Include, and the schemaLocation  Attribute

Namespaces and namespace prefixes in XSD seem simple enough until you start building a very large DFDL schema from multiple disjoint component schemas that are intended for reuse.

DFDL does not have any namespace features of its own, it simply passes through XML Schema's namespace and prefix system. 

(Note however: DFDL does not implement the XML Schema "redefine" construct, but neither do many regular XML Schema software platforms.)

Without following a reasonable set of standard practices it is quite easy to end up in what we call namespace hell. In this situation you get all sorts of diagnostic messages about symbols not being defined, but your import/include files seem to be well specified. Debugging this can be problematic, and you end up with roughly the situation that the guidance below specifies, just after much work and wasted time.

It's also the case that many DFDL applications do not use XML as their output data format. JSON is very popular also, and direct connectors to other data transformation and processing fabrics are in the works which have their own particular data models. XML's data model, and namespace system, really have no corresponding features in many of these other systems like JSON. (E.g., JSON does not have namespaces.) 

The practices here insure a DFDL schema's use of namespaces does not prevent parser/unparser creation/consumption of JSON, or other kinds of data output, using a DFDL processor. 

Staying out of Namespace Hell

The first set of simple rules for staying out of trouble is this:

  • For every target namespace, choose a unique prefix to use everywhere in your schema to refer to that namespace. 
    • The practice of using xmlns:tns prefix within schemas to refer to "this target namespace" should not be used.
  • Schema definitions should, with few exceptions, have a target namespace,.
  • A default namespace should be used only for the XML Schema namespace to avoid having to type "xs:" or "xsd:" everywhere. 

Different schema projects can use different prefixes, but within one schema project one namespace should mean one prefix globally across all files. 

The most critical guidance rules are these:

  • For every target namespace, one file must be the single distinguished one for that namespace. It is the one-and-only schemaLocation  file that is xs:import -ed anywhere one must import that namespace.  
  • That distinguished file must xs:include  all the other files that share that target namespace. 

Note that cyclic usage between namespaces is allowed. Two schema files can xs:import  each other. So long as they have different target namespaces.

However, xs:include  relationships cannot be cyclic.

The rest of this section is effectively just providing rationale for the above guidance. 

Things that Don't Work

Sometimes people want to decompose one namespace into several sub-units, and only import the symbols for the  features of that namespace they need and are using. So they expect they can import a namespace by importing only a specific file that contributes part of the definitions for that namespace. 

This does not​ work.  To achieve that sort of modularity you must decompose to different namespaces. 

The best mental model to understand this is: imagine all the schemaLocation  attributes were erased from all xs:import statements. Imagine the namespace URIs are actually being used to retrieve the namespace file. With this erasure you can only have one place where everything is getting that namespace because that namespace is defined by its URI, and that's also how you retrieve it. 

That's how XSD it works. One namespace == one source == one file providing its definition. 

Some people actually create schemas this way, without schemaLocation on xs:import statements. Then they use an XML Catalog to provide the 1 to 1 mapping of namespaces to the single distinguished file that provides its definition.

We have not used XMLCatalogs much and they are not recommended, as they introduce their own complexities. 

Going back to practices for xs:import, adding back in schemaLocation attributes, it should be clear now that all across a schema, there is a 1 to 1 association of namespaces to a specific schemaLocation. So every xs:import anywhere in your schema, for a given namespace X must provide the same exact schemaLocation Y. 

If you have, anywhere in your schema....

<xs:import namespace="ns" schemaLocation="location"/>

then for any specific ns, the location must always be the exact same location. 

What is the problem with the tns  prefix?

It often results in bigger XML due to the need to have xmlns:tns="...."  rebindings in multiple places in XML instance documents. When these are deep in the element nest they can be hard to find. 

It also makes XML instance documents harder to interpret (for people), as deep inside an XML document an element has tns:someName , but the binding of tns  prefix is far away (textually, for example many pages of text prior, but not necessarily at the start), and so not clear in that context.  Basically, when looking at an XML instance document, a person gets very little information from a tns  prefix. 

If tns  prefixes are used only for type and group references, and never for element references, one might find that this reduces some editing, and as element references are generally frowned upon this should not come up often. However, if the prefix definition xmlns:tns="...."  appears on the xs:schema element even when there is some other prefix also bound to the same namespace there is no telling whether a given XSD tool will actually use tns or the other prefix when identifying the root element in XML instance documents. So even if the schema author only ever uses tns for type and group references, the`tns` prefix can still show up and cause (albeit minor) confusion in XML instance documents. 

Best practice is just avoid this tns convention entirely.  

Avoid Child Elements with the Same Name

XML Schema has a data model with some flexibility needed only for markup languages intended for human authoring. 

DFDL uses XML Schema to describe structured data, where this flexibility is not needed. 

DFDL omits many XML Schema constructs, but DFDL version 1.0 still allows some things that are best avoided to insure the ability to interoperate with other data models. 

One such feature is the ability in XML Schema to have multiple child elements with the same name. So long as it is unambiguous what element declaration is intended, XML Schema allows things like:

...
<element name="foo" ..../>
<element name="bar" ..../>
<element name="foo" ..../>

This is allows because it the element bar  separates the two different declarations of the foo  element; hence, when parsing XML, the first foo  declaration is used until a bar  element is encountered, and after that the second foo  declaration is used. 

That's all interesting and useful for markup languages, but no other structured data system allows this. Hence, it is best avoided to enable DFDL schemas to be interfaced to data systems having other data models. 

Avoid Anonymous Choices

XML Schema allows a choice to be an anonymous thing within the data model of an element. For example:

<element name="myElement">
  <complexType>
    <sequence>
       ... various elements ...
       <choice>
         ... choice branches ...
       </choice>
       ... various more elements
    </sequence>
  </complexType>
</element>

The choice above appears in the middle of a sequence group, with elements before and after it. Note that there is no element name associated with the choice. Rather in XML data, the choice branches would contain elements and these would appear as direct children of the myElement  parent element.

No other data modeling language allows this. 

Hence, this is to be avoided. Choice groups should always be the model-groups of named elements.  

This is analogous to, but is adding yet a further restriction, to the DFDL requirement that optioal/recurring data can only be elements, not sequences/choices. 

By using only named choices, one insures one's DFDL schema can be mapped to the data structures of other data systems which do not allow anonymous choices. 



  • No labels

1 Comment

  1. See also this email about choices with empty branches, for example:

    <xs:choice>
    <xs:element name="foo" type="xs:int" />
    <xs:sequence />
    </xs:choice>

    This is best avoided as it causes incorrect XSD validation in current versions of Xerces C, a popular XML validator library. 

    See issue: Unable to render Jira issues macro, execution error.