Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update to match current code prototype

...

Daffodil Primary Data Structures

Daffodil is a compiler and runtime. The primary data structures are:

  • DSOM - Daffodil Schema Object Model

...

  • - the Abstract Syntax Tree (AST) of the DFDL schema.
  • Gram - The data "Grammar" objects - an intermediate compiler structure to support rule-based, backend-independent optimizations
  • Parser/Unparser objects - these are the runtime-specific objects that actually carry out parsing/unparsing.

The current Parser/Unparser objects are specific to daffodil-runtime1.  Introduction of daffodil-runtime2 requires replacing Parser/Unparser above with RuntimeGenerator. This is an object which is the output of the Daffodil schema compiler and which encapsulates runtime-specific optimizations and behavior.  The above list becomes:

  • DSOM
  • Gram
  • RuntimeGenerator
    • For daffodil-runtime1 - Parser/Unparser objects - these are the runtime-specific objects that actually carry out parsing/unparsing.
    • For daffodil-runtime2 - codgen.ast.Generator objects. (TBD: or objects encapsulating those in some manner.)

The DSOM and Gram layers of the Daffodil schema compiler should  be runtime-independent.

Note: We expect some refactoring, to move runtime-specific code down into the RuntimeGenerator layer.

DSOM - Daffodil Schema Object Model

Daffodil contains a DFDL schema compiler, primarily implemented in the daffodil-core module, and a runtime (currently as of 2019-09-25, daffodil-runtime1).

The

Daffodil contains a DFDL schema compiler, primarily implemented in the daffodil-core module, and a runtime (currently as of 2019-09-25, daffodil-runtime1).

The schema compiler parses a DFDL schema into a DSOM tree. DSOM is DFDL Schema Object Model. It is a set of classes that directly represent the DFDL schema. It is the abstract syntax tree (AST) of the DFDL schema.

...

Nothing computed on the DSOM tree should be in any way specific to any runtime system with the exception of the .

The runtimeData methods (runtimeData, termRuntimeData, modelGroupRuntimeData, ElementRuntimeData, etc.). These methods return RuntimeData objects, which are defined in daffodil-runtime1.

Gram or Grammar Objects

The DFDL specification contains something called the "Data Syntax Grammar". The notion is that data describable by DFDL must be something the data syntax grammar can describe.

Internally to Daffodil there is an object taxonomy rooted at the Gram trait (a Scala trait is much like a Java Interface). The Gram objects implement something that was intended to be a realistic implementation of the data syntax grammar found in the specification, but it has drifted substantially from anything closely related to what is in the DFDL spec. In many places it is to what is in the DFDL spec. The grammar in the DFDL specification is not suited to actual implementation. As a result, the Gram structures in Daffodil  in many places are not structured very much like grammar rules at all.

...

An important goal for the Gram objects is that the rules and the optimizations they perform are independent of any back-end/runtime strategy. That is they are universal. For example, a data format that does not use any sort of delimiters does not need any Gram objects corresponding to initiators, separators, nor terminators. Hence, all grammar rules associated with those regions of the grammar are folded away and disappear from the effective grammar describing the data format.

As with DSOM, the Gram objects use lazy evaluation to avoid the need to organize the compilation process into passes.

The leaves or terminals of the grammar are implementations of the Terminal class. These are generally called the grammar primitives, and those are where the back-end independent code meets the back-end specific code.

RuntimeGenerator Objects

Ultimately, the Gram objects construct a RuntimeGenerator object. The RuntimeGenerator object is specific to each separate runtime implementation strategy. For the daffodil-runtime1, the RuntimeGenerator object has parser() and unparser() methods which generate runtime1 parser and unparser objects, parameterizing them with information from the schema that controls behavior. The parser and unparser methods recursively construct these runtime1-specific Parser and Unparser class instances.

...

The RuntimeGenerator trait is new. The daffodil-runtime1 is actually implemented directly by the Gram objects. These can/should be refactored onto a runtime1 RuntimeGenerator class so as to provide a uniform API for developement of runtime2 and other runtimes. The creation of runtime2 is largely about refactoring members and methods on the Gram objects into:

  • general purpose shared members usable by multiple runtimes

  • runtime1-specific members (e.g., parser() and unparser())

process into passes.

The leaves or terminals of the grammar are implementations of the Terminal class. These are generally called the grammar primitives, and those are where the back-end independent code meets the back-end specific code.

Runtime Objects

The primitives of the grammar are where the schema compiler actually constructs the runtime artifacts.

For daffodil-runtime1, this is done with the parser() and unparser() methods of the grammar primitives. These return runtime1's Parser and Unparser class instances.

For runtime2, the generateCode() method invokes the runtime2 backend to generate the implementation.

To avoid exposing much about the runtime2 to the schema compiler, the signature of the generateCode() method takes and returns a CodeGeneratorState object. The notion is that the code generator starts with some initial state, and builds up declarations and code within it. Ultimately, the CodeGeneratorState object ends up with a complete copy of the generated code, and has the ability to write it out to source files.

DPath - Expressions in the

...

DFDL Path Language

Expression - the DPath Abstract Syntax Tree

...

The DPath compiler attempts to fold constants in expressions by attempting to evaluate expressions at compile time. Expressions that produce values without attempting to parse data are replaced by those constant values. This uses daffodil-runtime1 implementation of DPath, in a mode where attempting to access data or attempting to access the runtime infoset tree both result in failure and the expression being deemed "not constant".[NOTE]

Even when compiling expressions for different backend runtime implementations of DPath, the constant folding by way of daffodil-runtime1’s implementation of DPath can still be used.

Ultimately, the DPath compiler produces a CompiledExpression object which is implemented either as a ConstantExpression (when constant folding worked), or a RuntimeExpressionDPath object which contains the CompiledDPath.


Introducing additional runtimes beyond runtime1 requires introducing a new class ExpressionRuntimeGenerator. The compile() method of DFDLPathExpressionParser currently returns a CompiledExpression which is a daffodil-runtime1 object. We need the compile() method to instead return an ExpressionRuntimeGenerator which subsequently can be called for the runtime1 case to produce a CompiledExpression object.

The RecipeOp classes currently have a run() method. This must be refactored so that the run() method becomes part of a Runtime 1 data structure, and alternate runtime scan runtimes can have their own realizations. Effectively each RecipeOp becomes a generator of a "real" runtime1 RecipeOp, or of that of some other runtime.It is TBD whether this is too late, i.e., whether the DPathExpressionParser’s compile method contains runtime1-specific assumptionsof a "real" runtime1 RecipeOp, or of that of some other runtime.

Runtime 2 Design

Simplifying Assumptions

DFDL allows conforming subsets of features. Runtime 2 will be, roughly, the smallest possible conforming subset. ;of features. Runtime 2 will be, roughly, the smallest possible conforming subset of DFDL.

The table below is derived from Section 21 of the DFDL specification 1.0, with a third column added specifying the implementation goal for Runtime 2.

Feature

Detection

Implemented In Runtime 2 ?

Validation

External switch

No

Named Formats

dfdl:defineFormat or dfdl:ref

Yes

Choices

xs:choice in xsd

Not initially. Will be added with restrictions.

  • choiceDispatchKey only,  no backtracking choices
  • Depends on Expressions

Arrays where size not known in advance

dfdl:occursCountKind 'implicit', 'parsed', 'stopValue'

Not initially. Will be added with restrictions.

  • occursCountKind='expression' only. No backtracking.
  • Depends on Expressions

Expressions

Use of a DFDL expression in any property value

Not initially. Will be added with restrictions.
  • No runtime-valued properties except for dfdl:length and dfdl:occursCount

  • Note that if length and occursCount work, then implementing other runtime-valued properties may not be hard.

End of parent

dfdl:lengthKind "endOfParent"

No

Simple type restrictions

xs:simpleType in xsd

Yes
.
/Tolerated  (Requires no work for a
runtime backend, so why not
runtime backend, but the additional facets such types can provide are not checked as there is no validation. Runtime 2 effectively only implements the underlying primitive simple type of such derivations.)

Text representation for types other than String

dfdl:representation "text" for Number, Calendar or Boolean types

No.

Delimiters

dfdl:separator <> "" or dfdl:initiator <> "" or dfdl:terminator <> "" or dfdl:lengthKind "delimited"

No.

  • TBD: might have to soften this and allow terminators on simple type string only. Restricting the delimiter to 1 character only may be ok. This allows implementing e.g., null-terminated strings.

Nils

XSDL nillable 'true' in xsd

No.

Defaults

XSDL default or fixed in xsd

No.

Bi-Directional text.

dfdl:textBiDi 'yes'

No. (Note: This is being dropped from DFDL v1.0 because there are no implementations as yet. )

Lengths in Bits

dfdl:alignmentUnits 'bits' or dfdl:lengthUnits 'bits'

No. Lengths may be expressed in bits, but must be multiples of 8. So really it is bytes.

Delimited lengths and representation binary element

dfdl:representation 'binary' (or implied binary) and dfdl:lengthKind 'delimited'

No

Regular expressions

dfdl:lengthKind 'pattern',

dfdl:assert with dfdl:testkind 'pattern' ,

dfdl:discriminator with dfdl:testkind 'pattern'

No

Zoned numbers

dfdl:textNumberRep 'zoned'

No

IBM 390 packed numbers

dfdl:binaryNumberRep 'packed' 

No

IBM 390 packed calendars

dfdl:binaryCalendarRep 'packed'

No

IBM 390 floats

dfdl:binaryFloatRep 'ibm390Hex'

No

Unordered sequences

dfdl:sequenceKind 'unordered'

No

Floating elements

dfdl:floating 'yes'

No

dfdl functions in expression language

dfdl:functions in expression

Not initially.

  • Will be added with expression language.
  • The set of available functions may be limited, increasing over time.

Hidden groups

dfdl:hiddenGroupRef <> ''

Yes. (No work required in runtime backend
Not initially. Eventually should be added. (Is used with Calculated Values feature.)

Calculated values

dfdl:inputValueCalc <> '' or dfdl:outputValueCalc <> ''

Not initially. Eventually should be added.

Escape schemes

dfd:defineEscapeScheme in xsd

No

Extended encodings

Any dfdl:encoding value beyond the core list

No

Asserts

dfdl:assert in xsd

No

Discriminators

dfdl:discriminator in xsd

No

Prefixed lengths

dfdl:lengthKind 'prefixed'

No

Variables


dfdl:defineVariable,

dfdl:newVariableInstances,

dfdl:setVariable

Variables in DFDL expression language

Note that variables as a feature is dependent on the Expressions feature.

No

BCD calendars

dfdl:binaryCalendarRep "bcd"  

No

BCD numbers

dfdl:binaryNumberRep "bcd"

No

Multiple schemas

xs:include or xs:import in xsd

Yes. (Requires no work in a runtime backend.)

IBM 4690 packed numbers

dfdl:binaryNumberRep "ibm4690Packed"

No

IBM 4690 packed calendars

dfdl:binaryCalendarRep "ibm4690Packed"

No

DFDL Byte Value Entities

Use of %#r syntax in a DFDL String Literal other than the dfdl:fillByte property

No

DFDL Standard Character Set Encodings

dfdl:encoding name begins with "X-DFDL-".

No

Bit Order - Least Significant Bit First

dfdl:bitOrder with value 'leastSignificantBitFirst'

No

Daffodil extensions to DFDL - layering, blob objects, the dfdlx:emptyElementParsePolicy property, additional character sets (hex, octal, bits, and specialty sets), etc. all will not be supported.

...

  • each parse operation consumes data from an input stream, and produces a data structure. This data structure is not produced incrementally, but all at once.

    • Rationale: This eliminates the demands of streaming-parsingeliminates the technological hurdles of streaming-parsing which really are only needed for massively large data objects - large enough that bringing them into memory as an infoset object is problematic.

  • each unparse operation consumes one entire fully populated data structure, and produces data to an output stream.

    • Rationale: This massively simplifies unparsing by allowing expression evaluation to always assume the entire "infoset" object is already constructed. Expression evaluation never needs to support streaming, that is, be suspended waiting for additional infoset events to arrive.

  • I/O is byte-centric

    • data is byte-centric. That is no element can be of size that is not a multiple of bytes long. Alignment is always 1 byte.

    • data can be big or little endian.

    • character sets are all byte-oriented. Their code units are 8 bit bytes minimum.

    • Rationale: This set of constraints insures ordinary Java I/O supplies most of the I/O layer natively.

In general, use of unsupported features will cause a compile time SDE..

  • Only lengthKind 'explicit' or 'implicit' for simple types, and only lengthKind 'implicit' for complex types.

  • Only types long, int, short, byte, unsignedLong, unsignedInt, unsignedShort, unsignedByte, float, double,  string, and hexbinary , double,  string, and hexbinary are supported.

    • Leaves out the decimal, integers greater than 64 bits long, boolean and date/time related types

  • The dfdl:representation is always 'binary'. No text numbers are supported.

    • Leaves out the decimal, integers greater than 64 bits long, boolean and date/time related types

  • The dfdl:representation is always 'binary'. No text numbers are supported.

  • The dfdl:binaryNumberRep is always 'binary'.  Integers are fixed-length 2’s complement.

  • The dfdl:alignment is always 1 byte.

  • When added, note that occursCountKind="expression", and choices with only dfdl:choiceDispatchKey and dfdl:choiceBranchKey implies no backtracking/discrimination is required.

    • Rationale: This and requiring only dfdl:occursCountKind='expression' means there are no ponts of uncertainty, so there is no backtracking.

A large number of DFDL properties are going to be required to be defined, but if they do not have the correct value that is supported by the implementation it will be an SDE.

...

  • As the dfdl:binaryNumberRep is always 'binary', integers are fixed-length 2’s complement.

  • When added, note that occursCountKind="expression", and choices with only dfdl:choiceDispatchKey and dfdl:choiceBranchKey implies no backtracking/discrimination is required.

    • Rationale: This and requiring only dfdl:occursCountKind='expression' means there are no ponts of uncertainty, so there is no backtracking.

A large number of DFDL properties are going to be required to be defined, but if they do not have the correct value that is supported by the implementation it will be an SDE.

Properties that end up needed, but shouldn't be - ex: anything about text numbers, anything about date/time - are bugs in Daffodil that should be reported. An include-file DFDL format definition should hide these from users so they are not distracting.

Phases

The above restrictions on the features suggest dividing up the implementation of Runtime 2 into 2 distinct phases:

  • Phase 1: (aka Runtime2P1) No expressions. All lengths are fixed. All arrays have fixed length.
  • Phase 2: (aka Runtime2P2) Adding the DFDL expression language, lengthKind 'explicit', occursCountKind 'expression'.

Goals

  • Use Julian Feiauer contributed code generation library so as to have the possibility of Java, C++, and Python backends from Runtime 2
  • Initial focus is a backend where the Infoset is Java POJO objects. The POJO definitions are part of the generated code, which is output as one or multiple text files.
  • DPath expressions (when implemented) compile into native language expressions that navigate Infoset objects the way handwritten code would.

  • Dependencies on Java garbage collection should be minimized and documented.
  • The amount of runtime-library code should be minimum footprint.

    • Selective linking can be assumed (even for Java - search for GraalVM)

  • Satisfy the requirements that caused the PLC4X project to create their own MSpec data format language. (Alternatively, Daffodil with Runtime 2 should be a good target for MSpec compilation.)

    • With one exception: DFDL is still going to be XML Schema based. Changing the syntax of the DFDL language is out of scope, as that's a front-end project. This is a runtime/back-end project.

...