Using Apache Hive to post process Flume data

This is an incomplete example.

This is a rough example of how to use Apache Flume and Apache Hive together to consolidate many flume files into larger files that are loaded into a Hive warehouse and are queryable via Hive's SQL dialect..

Here is a Flume configuration that generates events and feeds it to a collector. count-forever is a script that creates a new line incrementing a value and outputs a new event one every 100ms. The collector buckets events by host and date. The filenames start with "data" and will have unique suffix (the 'rolltag'), and is in "raw" (body only) output format.

counter : exec("count-forever 100") |  agentE2ESink("collector");
collector : collectorSource | collector(30000) { escapedCustomDfs("hdfs://nn/user/jon/rawhosts/%{host}/%Y%m%d/","data%{rolltag}","raw") };

If you add UUID data to your application logs, like using mod_unique_id in the Apache webserver, deduping becomes trivial.

Let's say the source hosts is box1.example.com

create table counts(count INT, msg STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';

-- this moves the file
load data inpath '/user/jon/rawhosts/styx03.sf.cloudera.com/20110223' into table counts;
-- alternately you could use the 'alter' ddl command.

-- create new table for results, do a distinct.
create table count_distinct(count INT, msg STRING);
insert overwrite table count_distinct
select distinct count, msg from counts;

TODO: talk about how to process data in flume json format.

Some caveats to be aware of: load moves the file, and this still has small files. Next draft will probably use 'alter' and create a new data to write dedupe'd data.

Child pages

Using Apache Hive to post process Flume data