Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: fix link to ORC project in last section, add link in first section

ORC Files

Table of Contents
maxLevel3

ORC File Format

Info
titleVersion

Introduced in Hive version 0.11.0.

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

...

Index data includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable eeking seeking to the right compression block and byte within a decompressed block.  Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.

Having relatively frequent row index entries enables row-skipping within a stripe for rapid reads, despite large stripe sizes. By default every 10,000 rows can be skipped.

With the ability to skip large sets of rows based on filter predicates, you can sort a table on its secondary keys to achieve a big reduction in execution time. For example, if the primary partition is transaction date, the table can be sorted on state, zip code, and last name. Then looking for records in one state will skip the records of all other states.

A complete specification of the format is given in the ORC specification.

Hive QL HiveQL Syntax

File formats are specified at the table (or partition) level. You can specify the ORC file format with Hive QL HiveQL statements such as these:

  • CREATE TABLE ... STORED AS ORC
  • Wiki Markup{{ALTER TABLE ... \ [PARTITION partition_spec\] SET FILEFORMAT ORC}}
  • SET hive.default.fileformat=Orc

The parameters are all placed in the TBLPROPERTIES (see Create Table). They are:

Key

Default

Notes

orc.compress

ZLIB

high level compression (one of NONE, ZLIB, SNAPPY)

orc.compress.size

262,144

number of bytes in each compression chunk

orc.stripe.size

26843545667,108,864

number of bytes in each stripe

orc.row.index.stride

10,000

number of rows between index entries (must be >= 1000)

orc.create.index

true

whether to create row indexes

orc.bloom.filter.columns""comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp0.05false positive probability for bloom filter (must >0.0 and <1.0)

For example, creating an ORC stored table without compression:

No Format

create table Addresses (
  name string,
  street string,
  city string,
  state string,
  zip int
) stored as orc tblproperties ("orc.compress"="NONE");


Info
titleVersion 0.14.0+: CONCATENATE

ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file, starting in Hive 0.14.0. The merge happens at the stripe level, which avoids decompressing and decoding the data.

Serialization and Compression

...

The codec can be Snappy, Zlib, or none.

ORC File Dump Utility

The ORC file dump utility analyzes ORC files.  To invoke it, use this command:

Code Block
// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
 
// Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
 
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
 
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump] 
    [--backup-path <new-path>] <location-of-orc-file-or-directory>

Specifying -d in the command will cause it to dump the ORC file data rather than the metadata (Hive 1.1.0 and later).

Specifying --rowindex with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).

Specifying -t in the command will print the timezone id of the writer.

Specifying -j in the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata, add -p to the command.

Specifying --recover in the command will recover a corrupted ORC file generated by Hive streaming.

Specifying --skip-dump along with --recover will perform recovery without dumping metadata.

Specifying --backup-path with a new-path will let the recovery tool move corrupted files to the specified backup path (default: /tmp).

<location-of-orc-file> is the URI of the ORC file.

<location-of-orc-file-or-directory> is the URI of the ORC file or directory. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.

ORC Configuration Parameters

The ORC configuration parameters are described in Hive Configuration Properties – ORC File Format.

Anchor
orc-spec
orc-spec
ORC Format Specification

The ORC specification has moved to ORC project.