Using Apache Pig to post process Flume data

This is a rough example of how to use Apache Flume and Apache Pig together to consolidate many flume files into larger files.

Here is a Flume configuration that generates events and feeds it to a collector. count-forever is a script that creates a new line incrementing a value and outputs a new event one every 100ms. The collector buckets events by host and date. The filenames start with "data" and will have unique suffix (the 'rolltag'), and is in "raw" (body only) output format.

counter : exec("count-forever 100") |  agentE2ESink("collector");
collector : collectorSource | collector(30000) { escapedCustomDfs("hdfs://nn/user/jon/rawhosts/%{host}/%Y%m%d/","data%{rolltag}","raw") };

If you add UUID data to your application logs, like using mod_unique_id in the Apache webserver, deduping becomes trivial.

Let's say the source hosts is box1.example.com

raw = LOAD 'rawhosts/box1.example.com/20110223/' using PigStorage();
dist = distinct raw;
store dist into '/user/jon/pig/distinct' using PigStorage();

This solves 3 potential data ingestion problems – small files, duplicates, and bucketing data into date related groups.

That is it.

Child pages

Using Apache Pig to post process Flume data