Using HCatalog
Table of Contents |
---|
Info | ||
---|---|---|
| ||
HCatalog graduated from the Apache incubator and merged with the Hive project on March 26, 2013. |
...
Joe in data acquisition uses distcp
to get data onto the grid.
No Format |
---|
hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
|
...
Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.
No Format |
---|
A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, ...);
B = filter A by bot_finder(zeta) = 0;
...
store Z into 'data/processedevents/20100819/data';
|
With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.
No Format |
---|
A = load 'rawevents' using org.apache.hive.hcatalog.pig.HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; ... store Z into 'processedevents' using org.apache.hive.hcatalog.pig.HCatStorer("date=20100819"); |
...
Without HCatalog, Robert must alter the table to add the required partition.
No Format |
---|
alter table processedevents add partition 20100819 hdfs://data/processedevents/20100819/data
select advertiser_id, count(clicks)
from processedevents
where date = '20100819'
group by advertiser_id;
|
With HCatalog, Robert does not need to modify the table structure.
No Format |
---|
select advertiser_id, count(clicks)
from processedevents
where date = ‘20100819’
group by advertiser_id;
|
...
WebHCat is a REST API for HCatalog. (REST stands for "representational state transfer", a style of API based on HTTP verbs). The original name of WebHCat was Templeton. For more information, see the WebHCat manual.
Panel | ||||||
---|---|---|---|---|---|---|
| ||||||
Next: HCatalog Installation General: HCatalog Manual – WebHCat Manual – Hive Wiki Home – Hive Project Site |