...
Code Block |
---|
CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'; CREATE TABLE raw_sequence (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS SEQUENCEFILE; LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below) INSERT OVERWRITE TABLE raw_sequence SELECT LINE FROM raw; INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw; -- The previous line did not work for me, but this does. |
The value for io.seqfile.compression.type determines how the compression is performed. If you set it to RECORD you will get as many output files as the number of map/reduce jobs. If you set it to BLOCK, you will get as many output files as there were input files. There is a tradeoff involved here – large number of output files => more parellel map jobs => lower compression ratioRecord compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.