Proposal: BLOB Objects - Binary Large Objects

(Page is work in progress.)
See also: https://rmw42.wordpress.com/2018/05/23/growing-dfdls/ – Russ Williams ideas that are related to BLOBs, or solving related problems.

DFDL needs an extension that allows data much larger than memory to be manipulated.

A variety of data formats such as for image and video files, consist of fields of what is effectively metadata, surrounding large blocks of data containing compressed image or video data.

An important use case for DFDL is to expose this metadata for easy use, and to provide access to the large data via a streaming mechanism akin to opening a file.

In RDBMS systems, BLOB (Binary Large Object) and CLOB (Character large object) are the types used when the data row returned from an SQL query will not contain the actual value data, but rather a handle that can be used to open/read/write/close the BLOB or CLOB.

DFDL needs at least BLOB capability. This would enable processing of images or video of arbitrary size without the need to every hold all the data in memory.

This also eliminates the limitation on object size.

The Type of BLOBs is xs:string

We think of BLOBs as a replacement for hexBinary type objects. However, it is more sensible to model a BLOB in a DFDL schema as a string, the DFDL BLOB-related properties indicate that this string will not be data itself, but a BLOB handle/URI, that can be used to access the BLOB data.

BLOB Use Cases

There are a few different use cases. The variations have to do with how the BLOB data is accessed, over what time span it is accessible, and when resources can be reclaimed.

Image Filtering In A Process Pipeline

Parser produces an infoset containing a durable blob handle. This blob handle provides access to the blob data even after the parser has terminated, and the process exited.

The blob handle can be opened, to get an input stream, and the bytes in it read like any Java InputStream.

An API provides the ability to create a blob handle for a Java OutputStream (the two can be created simultaneously), which can then be opened, written, closed/flushed, and then blob handle can be used as a replacement for an input blob handle.

The notion here is that one opens and reads-from the input blob handle, and one processes the data, and if modified, you supply, on output, a replacement blob handle.

The unparser consumes an infoset containing a blob handle, and reads from it the data, writing that as the "contents" of the corresponding element.

The parser and unparser are independent processes that do not necessarily overlap in time existence. Their only communication is through the blob handle. Hence, the blob objects are allocated at the system level, and are not part of the state of the parser nor unparser. (E.g., they could be files).

A blob handle survives a reboot of the computer system - it's state is durable, so that if you write out the infoset from a parse of data, as an XML text file, then reboot the computer, you can then read that XML text file, find the BLOB handles within it, and open them.

A blob handle is some opaque URI, supporting the openStream API.

Each BLOB must be explicitly discarded. A convenience API might walk an entire infoset (as XML), and discard each BLOB found.

A non-native attribute daf:BLOB='true' is the XML representation of the Infoset for a BLOB-valued element. The blob handle is the VALUE of the element.

Single Process, Single Thread, SAX-style Event, Stateless

In this case, a single process with code written in Scala/Java is performing parse, transform, and unparse of data. The code is single threaded.

The parser is generating SAX-style infoset events for the start and end of each element.

An extended API (beyond regular SAX) uses a new method of a SAX-event-handler object which is specific to delivering a BLOB event. This BLOB event method is passed an open java.io.InputStream from which the BLOB data can be read. The BLOB handle is invisible from this API. (Maybe it should be provided for diagnostic message purposes?)

The lifetime of this BLOB input stream is only until the SAX-style event callback returns. At that point the resources/storage can be reclaimed.

So the parser BLOB API is that the parser calls the SAX-style event handler with a BLOB method, handing it an open input stream.

The unparser BLOB API is to be such a SAX-style event handler, and implement the BLOB method, reading data from the open input stream, and unparsing it.

In this use case, the DFDL schema element corresponding to the BLOB object must carry an explicit BLOB annotation (extension to DFDL v1.0) indicating that it is to be treated as a BLOB, and that its 'value' is a BLOB handle (Which could be a BLOB URI).

However, in this case, the BLOB handle, if output as text (e.g., by printing the resulting XML instead of unparsing it), just serves to document the past skipping over of the BLOB.

It is possible to parse and unparse an arbitrarily large image file in only finite memory using this API, so long as the image file format is streamable.

Implementation Note: For unparsing, DirectOrBufferedDataOutputStream may need to grow a special form of BufferedDataOutputStream which is a BLOB. No point in double buffering a BLOB, the BLOB object itself is very much a buffer. We simply need to know how to recombine its data into the DirectDataOutputStream at the right point in time.

Motivation for BLOB Feature for DFDL

Data objects larger than a single JVM object can store (e.g., video or images) may have to be represented in the Infoset by a proxy object. Standard streaming-style events normally produce simple values as regular objects representing the value. If a simple value is larger than a single JVM object can store, then a streaming API to access the value is needed.

BLOBs in the DFDL Infoset

The DFDL Infoset doesn't really specify what the [value] member is for a hexBinary object - that is it does not specify what the API is for accessing this value. Currently it is Array[Byte], but we can provide other abstractions. Also, the [value] member for type xs:string is assumed to be a java.lang.String, but we can provide other abstractions. Hence, the problem of BLOB objects is different depending on how the infoset is being accessed.

These handle objects would support the ability to open and access the contents of these large objects as java.nio.Channel or java.io.InputStream (for hexBinary), and java.io.Reader (for String). For unparsing channels or symmetric use of java.io.OutputStream or java.io.Writer are the basic mechanisms.

BLOBs in XML

When projecting the DFDL infoset into XML, these handle objects would have to show up as the XML serialization of the handle object, with usable members so that other software can access the data the handle is referring to. One example would be that the handle contains a fileName or URI and an offset (type Long) into it, and a length (type Long), and possibly the first N bytes/characters of the data.

Implementation Concerns

Want to avoid copying the BLOB data to a file when possible.
Want to take advantage of the ability to store offsets+lengths for BLOBSs as locations in the original input file.
Ideally, a BLOB needn't be bytes of the original file, but could be bytes inside a layer, such as a compressed region
- Ideally, such a BLOB could be accessed without having to decompress everything in advance into memory or a file. I.e., such a BLOB could be streamed from its layer, and the decompressing would happen as part of accessing it.
- This implies that while an offset maybe known, a length may not.
- This implies that while an offset may be known it is not necessarily a file offset, but an offset within some layer-stream.

Page tree