Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This also eliminates the limitation on object size.

The Type of BLOBs is xs:string

Basic Blob Requirements

The basic requirement has almost nothing to do with DFDL.

We want to represent an image file in XML, except the BLOB of compressed image data, which we want to reliably incorporate by reference.

So instead of

Code Block
<?xml version="1.0" ?>
<someImage>
  <lat>44.9090</lat>
  <lon>70.2929</lon>
  <img>
   098fad0965edcab...giant.hexbinary.string...many megs or gigs in size
  </img>
</someImage>

Instead the bytes corresponding to the image data go in a separate file "img.dat", and the infoset becomes

Code Block
<?xml version="1.0" ?>
<someImage>
  <lat>44.9090</lat>
  <lon>70.2929</lon>
  ... some way of saying img.dat blob goes here...
</someImage>

A few requirements:

  1. The document must still be validated relative to its DFDL schema/XML Schema - so the BLOB must be content that can be validated. That suggests it is an element. This validation does not have to touch or even verify the existence of the BLOB file.
    1. This means the element must be expressed in DFDL's subset of XML Schema. Hence, it is not an element with attributes, as attributes aren't part of the DFDL schema language.
  2. The BLOB must be able to refer to a region of bytes within a file. This is so that DFDL can be used to identify the location of the BLOB in a file being parsed, without having to copy or bring into memory the BLOB data. Rather, the Infoset can contain a BLOB that identifies the original file, and the location within it.
    1. Note: This is a special case of a general capability for any element in a DFDL schema - users may want to know its exact starting position and length, measured in bits or bytes, if only for trace/debug or verification purposes.
  3. It should not require Daffodil to be used to manipulate these XML files that contain BLOB references. Interpreting the BLOB information should not require information bases that are maintained by Daffodil libraries (e.g., mappings from GUIDs to files)
    1. We may want to provide a convenient scala/java library for this, it should not be bundled into Daffodil libraries, but be easily isolated.

One concrete suggestion is:

Code Block
<?xml version="1.0" ?>
<someImage>
  <lat>44.9090</lat>
  <lon>70.2929</lon>
  <img><BLOB daf:BLOB="true">../blobs/img.dat?offset0b=0;kind=raw</BLOB></img>
</someImage>

In the above we've introduced an element named BLOB which which takes a special URI which can be absolute or relative, and it identifies the blob data. The offset0b is a zero-based byte-offset into the file where the BLOB data starts. The suffix "0b" on the name indicates that it is zero-based, to distinguish from XML normal conventions which are 1-based. The value of offset0b defaults to 0. An optional length=N attribute would constrain the length of the BLOB data, and the kind=raw gives that the data is not encoded or compressed in any way. (kind=raw would be the default.)

This URL would be parsed by conventional URL libraries. What is called the query, the part after the "?" is a list (";" separated) of pairs of "keyword=value" form.

BLOBs as Layers

A DFDL schema using a BLOB would look, for example, like this:

Code Block
<element name="img" >
  <complexType>
    <sequence>
      <element name="BLOB" daf:layerBoundaryMark="[END-IMAGE]"
         type="daf:URI4BLOB" daf:layerTransform="daf:BLOB" daf:layerLengthKind="boundaryMark"/>
    </sequence>
  </complexType>
</element>

A schema containing daf:URI4BLOB would be provided and would contain roughly:

Code Block
<simpleType name="URI4BLOB" dfdl:encoding="utf-8">
  <restriction base="xs:string">
     <pattern value="..regex for these URIs.."/>
  </restriction>
</simpleType>

Here we see that a BLOB is actually created by way of a layering. The BLOB layer implements isolation of the BLOB contents, and produces (when parsing),  bytes containing the URI in UTF-8 encoding. When unparsing, the layer transform takes the URI, and obtains the corresponding bytes by opening the URI to obtain a Java InputStream. We think of BLOBs as a replacement for hexBinary type objects. However, it is more sensible to model a BLOB in a DFDL schema as a string, the DFDL BLOB-related properties indicate that this string will not be data itself, but a BLOB handle/URI, that can be used to access the BLOB data. 

BLOB Use Cases

There are a few different use cases. The variations have to do with how the BLOB data is accessed, over what time span it is accessible, and when resources can be reclaimed.

...

The blob handle can be opened, to get an input stream, and the bytes in it read like any Java InputStream.

The parser must be run in a BLOB='persistent' mode (API TBD for this.) which tells it to create permanent URIs, and never to release/delete the underlying resources in any automatic way.

An API provides the ability to create a blob handle for a Java OutputStream (the two can be created simultaneously), which can then be opened, written, closed/flushed, and then blob handle can be used as a replacement for an input blob handle.

...

A non-native attribute daf:BLOB='true' is the XML representation of the Infoset for indicating a BLOB-valued element. The blob handle is the VALUE of the element.

The lifetime of the BLOB resources (typically files) is not controlled in any way here any more than the lifetime of the original file.

Single Process, Single Thread, SAX-style Event, Stateless

...

The parser is generating SAX-style infoset events for the start and end of each element.

BLOBs are processed in a streaming mode (API call to set this TBD).

To process BLOB contents, the application's startElement() method would simply have to check for a blob (by calling isBLOB() method which is part of the extended API of an event handler).

(TBD: or we could require the handlers to be special blob-aware handler with a startBLOBElement() method and endBLOBElement() method. This potentially is lower overhead.An extended API (beyond regular SAX) uses a new method of a SAX-event-handler object which is specific to delivering a BLOB event. This BLOB event method is passed an open java.io.InputStream from which the BLOB data can be read. The BLOB handle is invisible from this API. (Maybe it should be provided for diagnostic message purposes?)

The lifetime of this BLOB input stream is only until the SAX-style event callback returns. At that point the resources/storage can be reclaimed.

...

Implementation Note: For unparsing, DirectOrBufferedDataOutputStream may need to grow a special form of BufferedDataOutputStream which is a BLOB. No point in double buffering a BLOB, the BLOB object itself is very much a buffer.  We simply need to know how to recombine its data into the DirectDataOutputStream at the right point in time.

Motivation for BLOB Feature for DFDL

Data objects larger than a single JVM object can store (e.g., video or images) may have to be represented in the Infoset by a proxy object. Standard streaming-style events normally produce simple values as regular objects representing the value. If a simple value is larger than a single JVM object can store, then a streaming API to access the value is needed.

BLOBs in the DFDL Infoset

The DFDL Infoset doesn't really specify what the [value] member is for a hexBinary object - that is it does not specify what the API is for accessing this value. Currently it is Array[Byte], but we can provide other abstractions. Also, the [value] member for type xs:string is assumed to be a java.lang.String, but we can provide other abstractions. Hence, the problem of BLOB objects is different depending on how the infoset is being accessed.

These handle objects would support the ability to open and access the contents of these large objects as java.nio.Channel or java.io.InputStream (for hexBinary), and java.io.Reader (for String). For unparsing channels or symmetric use of java.io.OutputStream or java.io.Writer are the basic mechanisms.

BLOBs in XML

...

tim


Implementation Concerns

  • Want to avoid copying the BLOB data to a file when possible.
  • Want to take advantage of the ability to store offsets+lengths for BLOBSs as locations in the original input file.
  • Ideally, a BLOB needn't be bytes of the original file, but could be bytes inside a layer, such as a compressed region
    • Ideally, such a BLOB could be accessed without having to decompress everything in advance into memory or a file. I.e., such a BLOB could be streamed from its layer, and the decompressing would happen as part of accessing it. 
    • This implies that while an offset maybe known, a length may not.
    • This implies that while an offset may be known it is not necessarily a file offset, but an offset within some layer-stream.

...