Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The initial implementation, started in HIVE-14217, focuses focused on 1) enabling the discovery of data that is already stored in Druid from Hive, and 2) being able to query that data, trying to make use of Druid advanced querying capabilities. For instance, we have put special emphasis on pushing as much computation as possible to Druid, and being able to recognize the type of queries for which Druid is specially efficient, e.g. timeseries or topN queries.

...

We can see there are three different groups of columns corresponding to the Druid categories: the timestamp column (__time) mandatory in Druid, the dimension columns (whose type is STRING), and the metrics columns (all the rest).

Create

...

Druid datasources from Hive

If we want to manage the data in the Druid datasources from Hive, there are multiple possible scenarios.

...

Code Block
sql
sql
CREATE TABLE druid_table_1
(`__time` TIMESTAMP, `dimension1` STRING, `dimension2` STRING, `metric1` INT, `metric2` FLOAT)
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler';

Observe that we still create three different groups of columns corresponding to the Druid categories: the timestamp column (__time) mandatory in Druid, the dimension columns (whose type is STRING), and the metrics columns (all the rest).

Another possible scenario is that our data is stored in Hive tables and we want to preprocess it and create Druid datasources from Hive to accelerate our SQL query workload. We can do that by executing a Create Table As Select (CTAS) statement. In the following we provide multiple examples for each of these statements.

Code Block
sql
sql
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
<select `timecolumn` as `___time`, `dimension1`, `dimension2`, `metric1`, `metric2`....>;

Insert data into Druid datasources from Hive

Assume that we have already indexed the wikiticker dataset mentioned previously, and the address of the Druid broker is 10.5.0.10:8082. It includes, among others, the possibility to use Create Table As Select (CTAS) statements to create datasources in Druid from Hive (HIVE-14474). 

Drop Hive tables associated to Druid datasources

Observe that we still create three different groups of columns corresponding to the Druid categories: the timestamp column (__time) mandatory in Druid, the dimension columns (whose type is STRING), and the metrics columns (all the rest).

Further, note that if we do not use EXTERNAL tables, we do not specify the value for the druid.datasource property. In fact, Hive automatically uses the fully qualified name of the table to create the corresponding datasource with the same name.

INSERT, INSERT OVERWRITE and DROP statements

These statements are supported by Hive managed tables (not external) backed by DruidAssume that we have already indexed the wikiticker dataset mentioned previously, and the address of the Druid broker is 10.5.0.10:8082. It includes, among others, the possibility to use Create Table As Select (CTAS) statements to create datasources in Druid from Hive (HIVE-14474)

Querying Druid from Hive

...