Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Fix spelling mistakes/typos

...

type="xs:anyURI" and dfdl:objectKind

DFDL is extended to allow simple types to have the xs:

...

anyURI type. Elements with this type will be treated as BLOB or CLOB objects.  The dfdlx:objectKind property is added to define what type of object it is. Valid values for this property are "bytes" for binary large objects and "characters" for character large objects.

An example of this usage in a DFDL schema may look something like this:

...

With the 1024 bytes of data being written to a file at location /path/to/blob/data.

For this initial propsoalproposal, the BLOB URI will always use the file scheme and must be absolute. Although this may be a restrictive limitation for some usecasesuse cases, the flexibilty flexibility and generality of URI's allows for future enhancement to support different or even custom schemes if needed. 

One benefit of this proposal is its simplicity and non-reliance on other DFDL extensions (e.g. one does not need to implement the DFDL layer extension to support this).

Regarding compatabilitycompatibility, any implementations that do not support this extension will likely error with an unsupported xs:anyURI  type. However, because the other syntax and behavior is very similar to types with xs:hexBinary, modifications to switch from xs:anyURI to xs:hexBinary sould should be minimal. 

Daffodil API

With a new simple type defined, some changes to the API are needed to specify where Daffodil should write these new BLOB files. A likely usecase use case is a need to  define a different BLOB output directory for each call to parse(). Thus, changes to the API must be made to define the output diretory either directly to the parse() function or to a paramter parameter already passed to the parse function. Since the InfosetOutputter  is related parse output, and the BLOB file is a sort of output, it makes the most sense for such definitions that control BLOB file output to added to the InfosetOutputter.

...

Note that no changes to the unparse() API are required, since the BLOB URI provides all the necessary information to retrieve files contianing containing BLOB data.

Schema Compilation

...

  1. As with hexBinary, determine the starting bitPosition and length of the hexBinary content
  2. Create a new BLOB file using directory/prefix/suffix information set in the InfosetOutputter.
  3. Open the newly created file using using a FileOutputStream. If opening of the file fails, throw a Schema Definition Error.
  4. Read length bytes of data from the ParseState dataInputStream and write them out to the FileOutputStream. Chunk the reads into smaller byte lengths to minimize total memory required and to support >2GB of data. If at any point no more bytes are available, throw a PENotEnoughBits parse error. If there is an IOException, throw a Schema Definition Error.
  5. Close the file stream.
  6. Set the value of the current element to the URI of the File.

AdditionalyAdditionally, logic must be created to remove BLOB files if Daffodil backtracks past an already created BLOB. This can be handled by storing the list of BLOB files in the PState, and upon deleting the appropriate files in the list before resetting back to an early state.

...

  1. Get the URI from the infoset and the file length. If the length cannot be determined, throw an UnparseError.
  2. As with hexBinary, determine the length of the hexBinary content and error if the BLOB file length is larger than the content length
  3. Open the File using a FileInputStream. If opening of the file falisfalls, throw an UnparseError
  4. Read bytes from the FileInputStream and write them to the UState dataOutputStream. Chunk the reads into smaller byte lengths to minimize total memory required and to support >2GB of data. If at any point there is an IOException, throw an UnparseError.
  5. As with hexBinary, write skip bits if the content length is not filled completely.

...

This proposal adds the restriction that any expression access to the data of a BLOB element is not be allowed. This limitation is really for practical purposes. Presumably, the xs:anyURI type is only to be used because the data is very large or meaningless, and so accessing the data is unnessaryunnecessary. This restriction minimizes complexity since expression do not need to worry about converting blobs to byte arrays or some thing else. If it is later determined that such a feature is needed, this restriction may be lifted. Any access to the data of a BLOB will result in a Schema Definition Error during schema compilation.

...

  1. Use the new API to specify a temp directory for BLOBs to be stored
      
  2. Perform type aware comparisions comparisons for the xs:anyURI type, similar to what we do now for xs:date, xs:dateTime, and xs:time. Type awareness will be enable by using the xsi:type  attribute on the expected infoset, since Daffodil does not currently supprt adding xsi:type information to the actual infoset as of yet. And example looks something like:

    Code Block
    languagexml
    <tdml:dfdlInfoset>
      <data xsi:type="dfdlx:blob">path/to/blob/data</data>
    </tdml:dfdlInfoset>

    During type aware comparisons, the TDML Runner will extract and modify the path (e.g. find the file and convert it to absolute in the infoset) to be suitable for use in logic similar to finding files using the type="file" attribute for expected infosets. Once the expected file is found, it will compare the contents of that file with the contents of the URI specified in the actual infoset and report any differences as usual.
      

  3. After a test completes, delete all BLOB files listed in the InfosetOutputter

...