...
Assume that we have already indexed the wikiticker dataset mentioned previously, and the address of the Druid broker is 10.5.0.10:8082. Thus
First, you need to set the Hive property hive.druid.broker.address.default
in your configuration file to point to the broker address:
Code Block | ||||
---|---|---|---|---|
| ||||
SET hive.druid.broker.address.default=10.5.0.10:8082; |
Then, to create a table that we can query from Hive, we we execute the following statement in Hive:
Code Block | ||||
---|---|---|---|---|
| ||||
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.address" = "10.5.0.10:8082", "druid.datasource" = "wikiticker"); |
Observe that you need to specify the broker address and datasource as TBLPROPERTIES using the druid.address
and the druid.datasource
properties, respectively property. Further, observe that the table needs to be created as EXTERNAL, as data is stored in Druid. The table is just a logical entity that we will use to express our queries, but there is no data movement when we create the table. In fact, what happened under the hood when you execute that statement, is that Hive sends a segment metadata query to Druid in order to discover the schema (columns and their types) of the data source. Retrieval of other information that might be useful such as statistics e.g. number of rows, is in our roadmap, but it is not supported yet. Finally, note that if we change the Hive property value for the default broker address, queries on this table will automatically run against the new broker address, as the address is not stored with the table.
If we execute a DESCRIBE statement, we can actually see the information about the table:
Code Block |
---|
hive> DESCRIBE FORMATTED druid_table_1;
OK
# col_name data_type comment
__time timestamp from deserializer
added bigint from deserializer
channel string from deserializer
cityname string from deserializer
comment string from deserializer
count bigint from deserializer
countryisocode string from deserializer
countryname string from deserializer
deleted bigint from deserializer
delta bigint from deserializer
isanonymous string from deserializer
isminor string from deserializer
isnew string from deserializer
isrobot string from deserializer
isunpatrolled string from deserializer
metrocode string from deserializer
namespace string from deserializer
page string from deserializer
regionisocode string from deserializer
regionname string from deserializer
user string from deserializer
user_unique string from deserializer
# Detailed Table Information
Database: druid
Owner: user1
CreateTime: Thu Aug 18 19:09:10 BST 2016
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs:/tmp/user1/hive/warehouse/druid.db/druid_table_1
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
EXTERNAL TRUE
druid.address 10.5.0.10:8082
druid.datasource wikiticker
numFiles 0
numRows 0
rawDataSize 0
storage_handler org.apache.hadoop.hive.druid.DruidStorageHandler
totalSize 0
transient_lastDdlTime 1471543750
# Storage Information
SerDe Library: org.apache.hadoop.hive.druid.serde.DruidSerDe
InputFormat: null
OutputFormat: null
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.111 seconds, Fetched: 55 row(s) |
We can see there are three different groups of columns corresponding to the Druid categories: the timestamp column (__time
) mandatory in Druid, the dimension columns (whose type is STRING), and the metrics columns (all the rest).
If you do not want to include the broker address for the creation of every Druid table, you can also set the Hive property hive.druid.address.default
to point to the broker address:
...
SET hive.druid.address.default=10.5.0.10:8082;
Then, to create a table similar to the previous one, we would execute the following statement:
...
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
In this case, the address is not stored with the table. Thus, if we change the Hive property value for the default broker address, queries on this table will automatically run against the new broker address.
Querying Druid from Hive
...