Data Lake Sink

When using the Data Lake sink, the incoming events are stored in an InfluxDB.

Implementation

org.apache.streampipes.sinks.internal.jvm.datalake

The concrete implementation comprises a Data Lake class, a Data Lake Controller class, a Data Lake InfluxDB Client class and a Data Lake Parameters class. The code is basically the same as for the InfluxDB sink (org.apache.streampipes.sinks.databases.jvm.influxdb).

Data Lake Parameters Class

The parameter class defines the necessary parameters for the configuration of the sink.

parameter	description
influxDbHost	hostname/URL of the InfluxDB instance. (including http(s)://)
influxDbPort	port of the InfluxDB instance
databaseName	name of the database where events will be stored
measureName	name of the Measurement where events will be stored (will be created if it does not exist)
user	username for the InfluxDB server
password	password for the InfluxDB server
timestampField	field which contains the required timestamp (field type = http://schema.org/DateTime)
batchSize	indicates how many events are written into a buffer, before they are written to the database
flushDuration	maximum waiting time for the buffer to fill the Buffer size before it will be written to the database in ms
dimensionProperties	list containing the tag fields (scope = dimension property)

Data Lake Controller Class

In controller class, the model is declared for viewing and configuration in Pipeline Editor, and initializes sink on invocation of pipeline.

The measurement name and the timestamp field are derived from user input, the remaining parameters (except batch size and flush duration) from org.apache.streampipes.sinks.internal.jvm.config.SinksInternalJvmConfig. Batch size is fixed to 2000 events and flush duration is set to 500 ms.

Data Lake Class

The data lake class itself essentially controls the saving of events to the database. For this purpose, it uses the Data Lake InfluxDB Client.

method name	description
onInvocation	starting the DataLakeInfluxDbClient, registering and initializing new measurement series in InfluxDB
onEvent	adding empty label field to incoming event and storing event in database
onDetach	stopping the DataLakeInfluxDbClient

Image data, unlike events, is not stored directly in database but as Image files in a corresponding directory (writeToImageFile).
In addition, the class contains two utility methods (registerAtDataLake and prepareString)

Data Lake InfluxDB Client Class

Client class that connects to InfluxDB and writes events directly to database. Uses the Data Lake Parameters described above.

method name	description
validate	checks whether the influxDbHost is valid
connect	connects to the InfluxDB server, sets the database and initializes the batch-behaviour
databaseExists	checks whether the given database exists
createDatabase	creates a new database with the given name
save	saves an event to the connnected InfluxDB database
stop	shuts down the connection to the InfluxDB server

TODO:
validate(): use validation method (org.apache.commons.validator.routines.InetAddressValidator) instead of regex check

REST API

DataLakeNoUserResourceV3

org.apache.streampipes.rest.impl.datalake

This class contains the basic interface definition for setting up a new measurement series in Data Lake and calls the underlying methods of org.apache.streampipes.dataexplorer.DataLakeNoUserManagementV3. Usage of the related API calls does not require any authentification with valid username and password.

method name	request type	path	description
addDataLake	POST	/{measure}	adds new measurement series with specified measure name and related event properties (column names) in InfluxDB

TODO:
[STREAMPIPES-348]: fix issue with special characters in user-defined measure name
add authentication obligation to addDataLake method

DataLakeResourceV3

org.apache.streampipes.ps

This class contains the extended interface definition and calls the underlying methods of org.apache.streampipes.dataexplorer.DataLakeManagementV3 and org.apache.streampipes.dataexplorer.utils.DataExplorerUtils when invoked. Usage of below mentioned API calls requires authentification with valid username and password.

method name	request type	path	description
getPage	GET	/data/{index}/paging	returns pages with predefined number of events per page of a specific measurement series from InfluxDB
getAllInfos	GET	/info	returns list with ids of all existing measurement series (including event schema) from InfluxDB
getAllData	GET	/data/{index} /data/{index}/last/{value}/{unit} /data/{index}/{startdate}/{enddate}	returns all stored events of a specific mesurement series from InfluxDB returns an aggregated set of all stored events of a specific mesurement series from InfluxDB returns all stored events within the specified time frame of a specific mesurement series from InfluxDB
getAllDataGrouping	GET	/data/{index}/{startdate}/{enddate}/grouping/{groupingTag}	returns all events within a specified time frame of a specific mesurement series grouped by a specific tag from InfluxDB
removeAllData	DELETE	/data/delete/all	removes all stored events from InfluxDB
downloadData	GET	/data/{index}/download /data/{index}/{startdate}/{enddate}/download	downloads all events of a specific mesurement series from InfluxDB in desired format downloads all events within a specified time frame of a specific mesurement series from InfluxDB in desired format
getImage	GET	/data/image/{route}/file	returns png image from file route
saveImageCoco	POST	/data/image/{route}/coco	stores image as file at file route
getImageCoco	GET	/data/image/{route}/coco	returns image at file route as application/json
labelData	POST	/data/{index}/{startdate}/{enddate}/labeling/{column}/{timestampColumn}?label=	updates label in specified column for all events within specified time frame to provided label value

TODO:
fix export of data from data lake, which currently returns two timestamp fields
extend aggregation functionality to support non-numeric values (e.g. strings → majority vote) add the possibility to specify an aggregation function
in general: alignment of the single endpoint definitions and consideration of the extensions below

Ideas for possible adaptations and extensions of the REST API

In addition to the TODOs listed above in the text, the following adjustments and enhancements might be worth considering. Thereby, it is important that the implementation of the endpoints is as independent as possible from the technology of the data lake (e.g. avoiding InfluxDB-specific methods).

Extension of the remove endpoint by the capability to
- selectively delete an individual measurement series
- delete measurements of a measurement series within a specific time interval or before a specific date
Adding an edit endpoint for adjusting data lake specific properties such as retention time.

Both extensions could be included in a kind of data management tool in the UI within an admin view (e.g. in the manner of the pipeline view).

Another possible adaptation would be the comprehensive implementation of an append-only approach for time series data. In particular, the functionality of the labelData method would have to be adapted here, which currently works with updates of existing DB entries.

References:

Apache StreamPipes Documentation: https://streampipes.apache.org/docs/docs/pe/org.apache.streampipes.sinks.internal.jvm.datalake/

functionality	request type	path	query parameters
listing all existing measurement series (incl. event schema)	GET	/list	-
getting measurements from specified measurement series	GET	/data/{index}	startDate: optional start date, if not specified: first element endDate: optional end date, if not specified: last element groupBy: optional grouping tags aggregationFunction: optional aggregation function (e.g. mean) timeInterval: optional time interval for aggregation (e.g. 1m - one minute), if specified: performing group by time
getting specific page with predefined number of events per page of a specific measurement series	GET	/data/{index}/paging	page: page number itemsPerPage: data points per page
downloading events of a specific mesurement series from in desired format	GET	/data/{index}/download	format: default "csv" download format (csv, json) startDate: optional start date, if not specified: first element endDate: optional end date, if not specified: last element
removing measurements from a specific measurement series	DELETE	/data/{index}/delete	startDate: optional start date, if not specified: first element endDate: optional end date, if not specified: last element
removing any or a specific measurement series	DELETE	/clear	index: optional index of measurement series
labeling of measurements	POST	/data/{index}/labeling	startDate: start date, if not specified: first element endDate: end date, if not specified: last element column: label column label: class label
specifying configuration parameters of data lake	POST	/edit	parameter_name = value (e.g. retention_time)
getting a list with all available files (e.g. key + metadata)	GET	/files/list	type: optional file type (image, COCO)
getting / saving a specific file (addressable via key)	GET / POST	/files	key
removing a specific file (addressable via key)	DELETE	/files/delete	key

My comments	request type	path	query parameters	Payload
	GET / DELETE	/measurement	-
	GET / DELTE	/measurement/{index}
How many measurements it will return? all? That can be a lot ;) What you think about a "limit" query param?	GET	/measurement/{index}/data	startDate: optional endDate: optional groupBy: optional aggregationFunction: timeInterval: optional
	GET	/measurement/{index}/data/page	page: itemsPerPage:
What is the difference to /measurement/{index}/data ? Maybe it is possible to use the same API, just add the format query param?	GET / DELETE	/measurement/{index}/data/download	format: startDate: optional endDate: optional
	POST / GET / DELETE	/measurement/{index}/label		startDate: endDate: column: label:
	GET	/measurement/{index}/label
	PUT (Updating)	/configuration		Pair-Values e.g. { retention_time: 10, ...}
	GET	/configuration
	GET / DELETE / POST	/file	type: optional
	GET/DELETE	/files/{key}

functionality	request type	path	query parameters	payload
listing all existing measurement series (incl. event schema)	GET	/measurements	-
removing all existing measurement series	DELETE	/measurements	-
getting data points of a specific measurement series	GET	/measurements/{measurementID}	slicing data by timestamp criterion startDate start date, if not specified: first element endDate end date, if not specified: last element paging page: page number itemsPerPage / limit: limit rows per page offset: specify offset (time_offset) grouping groupBy: grouping tags (comma-separated) aggregationFunction: aggregation function (e.g. mean) timeInterval: time interval for aggregation (e.g. 1m - one minute), if specified: performing group by time downloading can be combined with slicing operators format data format (csv, json)
removing a specific measurement series or selected measurements from that series	DELETE	/measurements/{measurementID}	startDate: optional start date, if not specified: first element endDate: optional end date, if not specified: last element
labeling of measurements	POST	/measurements/{measurementID}/labeling	startDate: start date, if not specified: first element endDate: end date, if not specified: last element	column: label column label: class label
specifying configuration parameters for a specific measurement series	POST	/measurements/{measurementID}/configuration	-	key/value-pairs:{parameter_name: value}
getting current configuration parameters for a specific measurement series	GET	/measurements/{measurementID}/configuration	-

Page tree

Implementation

Data Lake Parameters Class

Data Lake Controller Class

Data Lake Class

Data Lake InfluxDB Client Class

REST API

DataLakeNoUserResourceV3

DataLakeResourceV3

Ideas for possible adaptations and extensions of the REST API

9 Comments

Johannes Tex

Data Lake Controller Class

Data Lake Controller Class

Daniel Ebi

Dominik Riemer

Daniel Ebi

Suggestion for the revised endpoint definitions

Johannes Tex

Dominik Riemer

Daniel Ebi

Dominik Riemer

Johannes Tex