Introduction
Daffodil has a module called daffodil-runtime1. The suffix "1" was intentional to suggest that there would be other kinds of runtimes Daffodil backends in the future.
The goals of Runtime 1 were to get a correct, complete implementation of DFDL as quickly as possible. Making it as efficient as possible was important, but secondary to completeness and correctness. Other goals including streaming behavior (when the DFDL schema allows it), so that data larger than memory can be parsed/unparsed.
Today there are many parties interested in Daffodil but who have different requirements:
...
In many cases these requirements are primary, and are more important than numerous DFDL language features which are not used by the data formats of interest to these users.
The module daffodil-codegen-runtime2 c is an additional runtime backend for Daffodil intended to be a first attempt to accommodate some of these needs, and to illustrate how alternative Daffodil back-ends backends can be created.
Runtime 2 is, initially, daffodil-codegen-c is a very minimalist system , which handles only a tiny subset of DFDL , but from which it generates Java C source code for separate compilation and use. The subset of DFDL can increase in size over time, but initially it is intended to be the smallest possible subset that will illustrate how an alternative runtime backend can /should be constructed.
Daffodil Primary Data Structures
Daffodil is a compiler and runtime. The primary data structures are:
...
The current Parser/Unparser objects are specific to daffodil-runtime1. Introduction of daffodil-codegen-runtime2 c requires replacing Parser/Unparser above with RuntimeGeneratorCodeGenerator. This is an object which is the output of the Daffodil schema compiler and which encapsulates runtimegenerator-specific optimizations and behavior. The above list becomes:
- DSOM
- Gram
- RuntimeRuntimeGenerator
- For daffodil-runtime1 - Parser/Unparser objects created by parser() and unparser() methods - these are the runtime-specific objects that actually carry out parsing/unparsing. There are also RuntimeData classes which store information used by parsers/unparsers, and Evaluatable objects which encapsulate compiled expressions for evaluation at runtime.
- For daffodil-codegen-runtime2 c - codgen.ast. Generator objects generated by the Compiler. (TBD: or objects encapsulating those in some manner.)forLanguage() method. Calling their generateCode() method returns a newly created code directory containing C source code for parsing/unparsing a given schema.
The DSOM and Gram layers of the Daffodil schema compiler should be runtime - independent.
Note: We expect some refactoring, to move runtime-specific code down into the RuntimeGenerator layer.
...
DSOM - Daffodil Schema Object Model
Daffodil contains a DFDL schema compiler, primarily implemented in the daffodil-core module, and a runtime (currently as of 2019-09-25, daffodil-runtime1).
...
The runtimeData methods (runtimeData, termRuntimeData, modelGroupRuntimeData, ElementRuntimeData, etc.). These methods return RuntimeData objects, which are defined in daffodil-runtime1.
Gram or Grammar Objects
The DFDL specification contains something called the "Data Syntax Grammar". The notion is that data describable by DFDL must be something the data syntax grammar can describe.
Internally to Daffodil there is an object taxonomy rooted at the Gram trait (a Scala trait is much like a Java Interface). The Gram objects implement something that was intended to be a realistic implementation of the data syntax grammar found in the specification, but it has drifted substantially from anything closely related to what is in the DFDL spec. The grammar in the DFDL specification is not suited to actual implementation. As a result, the Gram structures in Daffodil Daffodil in many places are not structured very much like grammar rules at all.
...
An important goal for the Gram objects is that the rules and the optimizations they perform are independent of any back-endbackend/runtime strategy. That is, they are universal. For example, a data format that does not use any sort of delimiters does not need any Gram objects corresponding to initiators, separators, nor terminators. Hence, all grammar rules associated with those regions of the grammar are folded away and disappear from the effective grammar describing the data format.
...
The leaves or terminals of the grammar are implementations of the Terminal class. These are generally called the grammar primitives, and those are where the back-end independent code meets the back-end specific code.
Runtime Objects
The primitives of the grammar are where the schema compiler actually constructs the runtime artifacts.
For daffodil-runtime1, this is done with the parser() and unparser() methods of the grammar primitives. These return runtime1's Parser and Unparser class instances.
For runtime2daffodil-codegen-c, the generateCode() method invokes the runtime2 codegen-c backend to generate the implementation.
To avoid exposing much about the runtime2 code generator to the schema compiler, the signature of the generateCodeits forLanguage() method takes and returns a CodeGeneratorState CodeGenerator object instantiated via reflection API. The notion is that the code generator CodeGenerator object starts with some initial internal state , and builds up declarations and code within it when you call its generateCode() method. Ultimately, the CodeGeneratorState CodeGenerator object ends up with a complete copy of the generated code, and has the ability to write writes it out to source files, and returns the directory containing the source files.
DPath - Expressions in the DFDL Path Language
Expression - the DPath Abstract Syntax Tree
Daffodil’s schema compiler (in daffodil-core) also compiles DFDL’s expression language, which we call DPath (for DFDL Path), which is closely related to the standard XPath 2.0 language. We refer to the part of Daffodil’s schema compiler that compiles DPath expressions as the DPath compiler.
...
The DPath compiler attempts to fold constants in expressions by attempting to evaluate expressions at compile time. Expressions that produce values without attempting to parse data are replaced by those constant values. This uses daffodil-runtime1's implementation of DPath , in a mode where attempting to access data or attempting to access the runtime infoset tree both result results in failure and the expression being deemed "not constant".
...
Ultimately, the DPath compiler produces a CompiledExpression object which is implemented either as a ConstantExpression (when constant folding worked), or a RuntimeExpressionDPath object which contains the CompiledDPath.
Introducing additional runtimes backends beyond runtime1 requires introducing a new class ExpressionRuntimeGenerator. The compile() method of DFDLPathExpressionParser currently returns a CompiledExpression which is a daffodil-runtime1 object. We need the compile() method to instead return an ExpressionRuntimeGenerator which subsequently can be called for the runtime1 case to produce a CompiledExpression object. The RecipeOp classes currently have a run() method. This must be refactored so that the run() method becomes part of a Runtime 1 data structure, and alternate runtimes backends can have their own realizations. Effectively each RecipeOp becomes a generator of a "real" runtime1 RecipeOp, or of that of some other runtimebackend. |
...
Code Generator Design
Simplifying Assumptions
DFDL allows conforming subsets of features. Runtime 2 The Code Generator subset will be, roughly, the smallest possible conforming subset of DFDL.
The table below is derived from Section 21 of the DFDL specification 1.0, with a third column added specifying the implementation goal for Runtime 2code generation.
Feature | Detection | Implemented In Runtime 2 Code Generator? |
Validation | External switch | No |
Named Formats | dfdl:defineFormat or dfdl:ref | Yes |
Choices | xs:choice in xsd | Not initially. Will be added with restrictions.
|
Arrays where size not known in advance | dfdl:occursCountKind 'implicit', 'parsed', 'stopValue' | Not initially. Will be added with restrictions.
|
Expressions | Use of a DFDL expression in any property value | Not initially. Will be added with restrictions.
|
End of parent | dfdl:lengthKind "endOfParent" | No |
Simple type restrictions | xs:simpleType in xsd | Yes/Tolerated Tolerated. (Requires no work for in a runtime backend , but the which already implements the underlying primitive simple type of such derivations, but the additional facets such types can provide are not checked as when there is no validation. Runtime 2 effectively only implements the underlying primitive simple type of such derivations.) |
Text representation for types other than String | dfdl:representation "text" for Number, Calendar or Boolean types | No. |
Delimiters | dfdl:separator <> "" or dfdl:initiator <> "" or dfdl:terminator <> "" or dfdl:lengthKind "delimited" | No.
|
Nils | XSDL nillable 'true' in xsd | No. |
Defaults | XSDL default or fixed in xsd | No. |
Bi-Directional text. | dfdl:textBiDi 'yes' | No. (Note: This is being dropped from DFDL v1.0 because there are no implementations as yet. ) |
Lengths in Bits | dfdl:alignmentUnits 'bits' or dfdl:lengthUnits 'bits' | Initially No. LengthsEventually Yes.
|
Delimited lengths and representation binary element | dfdl:representation 'binary' (or implied binary) and dfdl:lengthKind 'delimited' | No |
Regular expressions | dfdl:lengthKind 'pattern', dfdl:assert with dfdl:testkind 'pattern' , dfdl:discriminator with dfdl:testkind 'pattern' | No |
Zoned numbers | dfdl:textNumberRep 'zoned' | No |
IBM 390 packed numbers | dfdl:binaryNumberRep 'packed' | No |
IBM 390 packed calendars | dfdl:binaryCalendarRep 'packed' | No |
IBM 390 floats | dfdl:binaryFloatRep 'ibm390Hex' | No |
Unordered sequences | dfdl:sequenceKind 'unordered' | No |
Floating elements | dfdl:floating 'yes' | No |
dfdl functions in expression language | dfdl:functions in expression | Not initially.
|
Hidden groups | dfdl:hiddenGroupRef <> '' | Not initially. Eventually should be added. (Is used with Calculated Values feature.) |
Calculated values | dfdl:inputValueCalc <> '' or dfdl:outputValueCalc <> '' | Not initially. Eventually should be added. |
Escape schemes | dfd:defineEscapeScheme in xsd | No |
Extended encodings | Any dfdl:encoding value beyond the core list | No |
Asserts | dfdl:assert in xsd | No |
Discriminators | dfdl:discriminator in xsd | No |
Prefixed lengths | dfdl:lengthKind 'prefixed' | No |
| ||
Escape schemes | dfd:defineEscapeScheme in xsd | No |
Extended encodings | Any dfdl:encoding value beyond the core list | No |
Asserts | dfdl:assert in xsd | No |
Discriminators | dfdl:discriminator in xsd | No |
Prefixed lengths | dfdl:lengthKind 'prefixed' | No |
Variables | dfdl:defineVariable, dfdl:newVariableInstances, dfdl:setVariable Variables in DFDL expression language Note that variables as a feature is dependent on the Expressions feature. | No |
BCD calendars | dfdl:binaryCalendarRep "bcd" | No |
BCD numbers | dfdl:binaryNumberRep "bcd" | No |
Multiple schemas | xs:include or xs:import in xsd | Yes. (Requires no work in a runtime backend.) |
IBM 4690 packed numbers | dfdl:binaryNumberRep "ibm4690Packed" | No |
IBM 4690 packed calendars | dfdl:binaryCalendarRep "ibm4690Packed" | No |
DFDL Byte Value Entities | Use of %#r syntax in a DFDL String Literal other than the dfdl:fillByte property | No |
DFDL Standard Character Set Encodings | dfdl:encoding name begins with "X-DFDL-". | No |
Bit Order - Least Significant Bit First | dfdl:bitOrder with value 'leastSignificantBitFirst' | No |
...
Additional characteristics we expect Runtime 2 Code Generators to have, which simplify the implementation:
each parse operation consumes data from an input stream , and produces a data structure. This data structure is not produced incrementally, but all at once.
Rationale: This eliminates the technological hurdles of streaming-parsing which really are only needed for massively large data objects - large enough that bringing them into memory as an infoset object is problematic.
each unparse operation consumes one entire fully populated data structure , and produces data to an output stream.
Rationale: This massively simplifies unparsing by allowing expression evaluation to always assume the entire "infoset" object is already constructed. Expression evaluation never needs to support streaming, that is, be suspended waiting for additional infoset events to arrive.
I/O is byte-centric
data is byte-centric. That is no element can be of size that is not a multiple of bytes long. Alignment is always 1 byte.
data can be big or little endian.
character sets are all byte-oriented. Their code units are 8 bit bytes minimum.
Rationale: This set of constraints insures ordinary Java C I/O supplies most of the I/O layer natively.
...
Only lengthKind 'explicit' or 'implicit' for simple types, and only lengthKind 'implicit' for complex types.
Only types boolean, long, int, short, byte, unsignedLong, unsignedInt, unsignedShort, unsignedByte, float, double, stringhexBinary, and hexbinary string are supported.
Leaves out the decimal, integers greater than 64 bits long, boolean and date/time related types
The dfdl:representation is always 'binary'. No text numbers are supported.
As the dfdl:binaryNumberRep is always 'binary', integers are fixed-length 2’s complement.
When added, note that occursCountKind="expression", and choices with only dfdl:choiceDispatchKey and dfdl:choiceBranchKey implies no backtracking/discrimination is required.
Rationale: This and requiring only dfdl:occursCountKind='expression' means there are no ponts points of uncertainty, so there is no backtracking.
...
Properties that end up needed , but shouldn't be - ex: anything about text numbers, anything about date/time - are bugs in Daffodil that should be reported. An include-file DFDL format definition should hide these from users, so they are not distracting.
Phases
The above restrictions on the features suggest dividing up the implementation of Runtime 2 Code Generators into 2 distinct phases:
- Phase 1: (aka Runtime2P1) No expressions. All lengths are fixed. All arrays have fixed length.
- Phase 2: (aka Runtime2P2) Adding the DFDL expression language, lengthKind 'explicit', occursCountKind 'expression'.
Goals
- Use Julian Feiauer contributed code generation library so as to have the possibility of Java, C++, and Python backends from Runtime 2
- Initial focus is a backend where the Infoset is Java POJO objectscomposed from C struct definitions. The POJO C struct definitions are part of the generated code, which is output as one or multiple text files.
DPath expressions (when implemented) compile into native C language expressions that navigate Infoset objects the way handwritten code would.
- Dependencies on Java garbage collection should be minimized and documented.
The amount of runtime-library code should be minimum footprint.
Selective linking can be assumed (even for Java - search for GraalVM)
structs the way handwritten code would.
The amount of runtime-library code should produce a minimum footprint
Satisfy the requirements that caused the PLC4X project to create their own MSpec data format language. (Alternatively, Daffodil with Runtime 2 should be a good target for MSpec compilation.)
With one exception: DFDL is still going to be XML Schema based. Changing the syntax of the DFDL language is out of scope, as that's a front-end project. This is a runtime/back-end project.
Random Implementation Notes
ParseError and UnparseError must be supported. both Both are always fatal as there are no points-of-uncertainty/backtracking.
RuntimeSDE ??
TBD: Can runtime SDEs occur? We may have eliminated all possibilities for them.
PState and UState (state of parser/unparser) and mutable data structures reachable from them are thread specific. All other data structures are shared across threads and immutable or thread-safe.
JUnitTDML-style tests should be easily created. This can use Scala so as to take advantage of XML syntax in the language so that schemas can be created in the test files.