Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

CREATE TABLE raw_sequence (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
   STORED AS SEQUENCEFILE;

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT LINE FROM raw;

INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw; -- The previous line did not work for me, but this does.

The value for io.seqfile.compression.type determines how the compression is performed. If you set it to RECORD you will get as many output files as the number of map/reduce jobs. If you set it to BLOCK, you will get as many output files as there were input files. There is a tradeoff involved here – large number of output files => more parellel map jobs => lower compression ratioRecord compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.