You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Introduction

Daffodil has a module called daffodil-runtime1. The suffix "1" was intentional to suggest that there would be other kinds of runtimes in the future.

The goals of Runtime 1 were to get a correct, complete implementation of DFDL as quickly as possible. Making it as efficient as possible was important, but secondary to completeness and correctness.

Today there are many parties interested in Daffodil but who have different requirements:

  • selective linking or other technique to keep memory footprint small

  • C/C++ code generation for non-JVM environments

  • native object population - ex: for Java, directly populating POJO objects corresponding to the logical DFDL schema objects. (Akin to how JAXB fills in objects from XML data.)

In many cases these requirements are primary, and are more important than numerous DFDL language features which are not used by the data formats of interest to these users.

The module daffodil-runtime2 is an additional runtime for Daffodil intended to be a first attempt to accommodate some of these needs, and to illustrate how alternative Daffodil back-ends can be created.

Runtime 2 is, initially, a very minimalist system, which handles only a tiny subset of DFDL, but which generates Java source code for separate compilation and use. The subset of DFDL can increase in size over time, but initially it is intended to be the smallest possible subset that will illustrate how an alternative runtime can/should be constructed.

Daffodil Primary Data Structures

DSOM - Daffodil Schema Object Model

Daffodil contains a DFDL schema compiler, primarily implemented in the daffodil-core module, and a runtime (currently as of 2019-09-25, daffodil-runtime1).

The schema compiler parses a DFDL schema into a DSOM tree. DSOM is DFDL Schema Object Model. It is a set of classes that directly represent the DFDL schema. It is the abstract syntax tree (AST) of the DFDL schema.

The DSOM tree has numerous members on it which are computed by reference to other parts of the DSOM model. Lazy evaluation is used to avoid the need to compose these into passes.

Nothing computed on the DSOM tree should be in any way specific to any runtime system with the exception of the runtimeData methods (runtimeData, termRuntimeData, modelGroupRuntimeData, ElementRuntimeData, etc.). These methods return RuntimeData objects, which are defined in daffodil-runtime1.

Gram or Grammar Objects

The DFDL specification contains something called the "Data Syntax Grammar". The notion is that data describable by DFDL must be something the data syntax grammar can describe.

Internally to Daffodil there is an object taxonomy rooted at the Gram trait (a Scala trait is much like a Java Interface). The Gram objects implement something that was intended to be a realistic implementation of the data syntax grammar found in the specification, but it has drifted substantially from anything closely related to what is in the DFDL spec. In many places it is not structured very much like grammar rules at all.

The Gram objects use techniques similar to Scala’s Parser Combinators and something like guarded-clause logic to implement simple rule-based optimizers.

An important goal for the Gram objects is that the rules and the optimizations they perform are independent of any back-end strategy. That is they are universal. For example, a data format that does not use any sort of delimiters does not need any Gram objects corresponding to initiators, separators, nor terminators. Hence, all grammar rules associated with those regions of the grammar are folded away and disappear from the effective grammar describing the data format.

As with DSOM, the Gram objects use lazy evaluation to avoid the need to organize the compilation process into passes.

The leaves or terminals of the grammar are implementations of the Terminal class. These are generally called the grammar primitives, and those are where the back-end independent code meets the back-end specific code.

RuntimeGenerator Objects

Ultimately, the Gram objects construct a RuntimeGenerator object. The RuntimeGenerator object is specific to each separate runtime implementation strategy. For the daffodil-runtime1, the RuntimeGenerator object has parser() and unparser() methods which generate runtime1 parser and unparser objects, parameterizing them with information from the schema that controls behavior. The parser and unparser methods recursively construct these runtime1-specific Parser and Unparser class instances.


The RuntimeGenerator trait is new. The daffodil-runtime1 is actually implemented directly by the Gram objects. These can/should be refactored onto a runtime1 RuntimeGenerator class so as to provide a uniform API for developement of runtime2 and other runtimes. The creation of runtime2 is largely about refactoring members and methods on the Gram objects into:

  • general purpose shared members usable by multiple runtimes

  • runtime1-specific members (e.g., parser() and unparser())

DPath - DFDL Path Language

Expression - the DPath Abstract Syntax Tree

Daffodil’s schema compiler (in daffodil-core) also compiles DFDL’s expression language, which we call DPath (for DFDL Path), which is closely related to the standard XPath 2.0 language. We refer to the part of Daffodil’s schema compiler that compiles DPath expressions as the DPath compiler.

The DFDLExpressionParser class is a parser for DPath which parses DPath expressions into an AST of Expression class/trait objects. The AST Expression instances have a compiledDPath member which evaluates into a CompiledDPath object (defined in daffodil-runtime1) containing a sequence of RecipeOp operations (also defined in daffodil-runtime1) which are runtime operations that actually carry out expression evaluation. During this DPath compilation static type analysis is performed.

The DPath compiler attempts to fold constants in expressions by attempting to evaluate expressions at compile time. Expressions that produce values without attempting to parse data are replaced by those constant values. This uses daffodil-runtime1 implementation of DPath, in a mode where attempting to access data or attempting to access the runtime infoset tree both result in failure and the expression being deemed "not constant".

[NOTE] Even when compiling expressions for different backend runtime implementations of DPath, the constant folding by way of daffodil-runtime1’s implementation of DPath can still be used.

Ultimately, the DPath compiler produces a CompiledExpression object which is implemented either as a ConstantExpression (when constant folding worked), or a RuntimeExpressionDPath object which contains the CompiledDPath.


Introducing additional runtimes beyond runtime1 requires introducing a new class ExpressionRuntimeGenerator. The compile() method of DFDLPathExpressionParser currently returns a CompiledExpression which is a daffodil-runtime1 object. We need the compile() method to instead return an ExpressionRuntimeGenerator which subsequently can be called for the runtime1 case to produce a CompiledExpression object.

The RecipeOp classes currently have a run() method. This must be refactored so that the run() method becomes part of a Runtime 1 data structure, and alternate runtime scan have their own realizations. Effectively each RecipeOp becomes a generator of a "real" runtime1 RecipeOp, or of that of some other runtime.

It is TBD whether this is too late, i.e., whether the DPathExpressionParser’s compile method contains runtime1-specific assumptions.

Runtime 2 Simplifying Assumptions

  • each parse operation consumes data from an input stream, and produces a data structure. This data structure is not produced incrementally, but all at once.

    • Rationale: This eliminates the demands of streaming-parsing.

  • each unparse operation consumes one entire fully populated data structure, and produces data to an output stream.

    • Rationale: This massively simplifies unparsing by allowing expression evaluation to always assume the entire "infoset" object is already constructed. Expression evaluation never needs to support streaming, that is, be suspended waiting for additional infoset events to arrive.

  • I/O is byte-centric

    • data is byte-centric. That is no element spans a byte boundary.

    • data can be big or little endian.

    • bitOrder is mostSignificantBitFirst always

    • character sets are all byte-oriented. Their code units are 8 bit bytes minimum.

    • Rationale: This set of constraints insures ordinary Java I/O supplies most of the I/O layer natively.

Initial Simplifying Restrictions

These will cause a compile time SDE. These restrictions may be lifted over time.

  • No variables, no dfdl:setVariable dfdl:newVariableInstance

  • No delimiters (tbd: might have to soften this and allow terminators on simple type string only. Restricting the delimiter to 1 character only may be ok.).

  • No runtime-valued properties except for dfdl:length and dfdl:occursCount

    • Note that if length and occursCount work, then implementing other runtime-valued properties may not be hard.

  • Only lengthKind 'explicit' or 'implicit' for simple types, and only lengthKind 'implicit' for complex types.

  • Limited set of expression functions

  • Only types long, int, short, byte, unsignedLong, unsignedInt, unsignedShort, unsignedByte, float, double,  string, and hexbinary are supported. (Leaves out the boolean and date/time related types)

  • The dfdl:representation is always 'binary'. No text numbers are supported.

  • The dfdl:binaryNumberRep is always 'binary'. Integers are fixed-length 2’s complement.

  • dfdl:binaryFloatRep="ieee".

  • The dfdl:alignment is always 1 byte.

  • No unordered sequences or floating elements

  • No validation

  • Only occursCountKind="expression", and choices must use dfdl:choiceDispatchKey and dfdl:choiceBranchKey. No backtracking/discrimination allowed.

    • Rationale: This and requiring only dfdl:occursCountKind='expression' means there are no ponts of uncertainty, so there is no backtracking.

A large number of DFDL properties are going to be required to be defined, but if they do not have the correct value that is supported by the implementation it will be an SDE.

Properties that end up needed, but shouldn't be - ex: anything about text numbers, anything about date/time - are bugs in Daffodil that should be reported. An include-file DFDL format definition should hide these from users so they are not distracting.

Goals

  • Infoset is Java POJO objects. The POJO definitions are part of the generated code, which is output as one or multiple text files.

  • DPath expressions compile into Java expressions that navigate Java POJOs the way handwritten code would.

  • Translating Runtime 2 into a similar C/C++ runtime should be conceptually quite easy. Nothing should be done in Runtime 2 that is particularly Java specific. (I/O streams that operate like Java BufferedInputStream and OutputStream are assumed.

    • Dependencies on Java garbage collection should be minimized and documented.

  • The amount of runtime-library code should be minimum footprint.

    • Selective linking can be assumed (even for Java - search for GraalVM)

  • Satisfy the requirements that caused the PLC4X project to create their own MSpec data format language. (Or to serve as a target for MSpec compilation.)

    • With one exception: DFDL is still going to be XML Schema based. Changing the syntax of the DFDL language is out of scope.

Implementation Notes

  • ParseError and UnparseError must be supported. both are always fatal as there are no points-of-uncertainty/backtracking

  • RuntimeSDE ??

    • TBD: Can runtime SDEs occur? We may have eliminated all possibilities for them.

  • PState and UState (state of parser/unparser) are thread specific. All other data structures are shared/thread-safe.

  • JUnit-style tests should be easily created. This can use Scala so as to take advantage of XML syntax in the language so that schemas can be created in the test files.

  • No labels