Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info
titleVersion information

Druid integration is introduced in Hive 2.2.0 (HIVE-14217). It is Initially it was compatible with Druid 0.9.1.1, the latest stable release of Druid to that date.

Introduction

This page documents the work done for the integration between Druid and Hive, which was started in HIVE-14217.

...

Future work after the first step is completed is being listed in HIVE-14473. It includes, among others, the possibility to use Create Table As Select (CTAS) statements to create datasources in Druid from Hive (HIVE-14474). If you want to collaborate on this effort, a list of remaining issues can be found at the end of this document.

...

For the running examples, we use the wikiticker dataset included in the quickstart tutorial of Druid.

Discovery and management of Druid datasources from Hive

First we focus on the discovery and management of Druid datasources from Hive.

Create tables linked to existing Druid datasources

Assume that we have already indexed stored the wikiticker dataset mentioned previously in Druid, and the address of the Druid broker is 10.5.0.10:8082.

...

We can see there are three different groups of columns corresponding to the Druid categories: the timestamp column (__time) mandatory in Druid, the dimension columns (whose type is STRING), and the metrics columns (all the rest).

Create tables linked to new Druid datasources

If we want to manage the data in the Druid datasources from Hive, there are multiple possible scenarios.

For instance, we might want to create an empty table backed by Druid using a CREATE TABLE statement and then append and overwrite data using INSERT and INSERT OVERWRITE Hive statements, respectively.

Code Block
sql
sql
CREATE TABLE druid_table_1
(`__time` TIMESTAMP, `dimension1` STRING, `dimension2` STRING, `metric1` INT, `metric2` FLOAT)
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler';

Observe that we still create three different groups of columns corresponding to the Druid categories: the timestamp column (__time) mandatory in Druid, the dimension columns (whose type is STRING), and the metrics columns (all the rest).

Another possible scenario is that our data is stored in Hive tables and we want to preprocess it and create Druid datasources from Hive to accelerate our SQL query workload. We can do that by executing a Create Table As Select (CTAS) statement. In the following we provide multiple examples for each of these statements.

Code Block
sql
sql
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
<select `timecolumn` as `___time`, `dimension1`, `dimension2`, `metric1`, `metric2`....>;

Insert data into Druid datasources from Hive

Assume that we have already indexed the wikiticker dataset mentioned previously, and the address of the Druid broker is 10.5.0.10:8082. It includes, among others, the possibility to use Create Table As Select (CTAS) statements to create datasources in Druid from Hive (HIVE-14474). 

Drop Hive tables associated to Druid datasources

Assume that we have already indexed the wikiticker dataset mentioned previously, and the address of the Druid broker is 10.5.0.10:8082. It includes, among others, the possibility to use Create Table As Select (CTAS) statements to create datasources in Druid from Hive (HIVE-14474). 

Querying Druid from Hive

Once we have created our first table stored in Druid using the DruidStorageHandler, we are ready to execute our queries against Druid.

...