ORC Files

Table of Contents

maxLevel	3

ORC File Format

Info

title	Version

Introduced in Hive version 0.11.0.

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

...

Index data includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable eeking seeking to the right compression block and byte within a decompressed block. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.

Having relatively frequent row index entries enables row-skipping within a stripe for rapid reads, despite large stripe sizes. By default every 10,000 rows can be skipped.

With the ability to skip large sets of rows based on filter predicates, you can sort a table on its secondary keys to achieve a big reduction in execution time. For example, if the primary partition is transaction date, the table can be sorted on state, zip code, and last name. Then looking for records in one state will skip the records of all other states.

A complete specification of the format is given in the ORC specification.

Hive QL HiveQL Syntax

File formats are specified at the table (or partition) level. You can specify the ORC file format with Hive QL HiveQL statements such as these:

CREATE TABLE ... STORED AS ORC
Wiki Markup{{ALTER TABLE ... \ [PARTITION partition_spec\] SET FILEFORMAT ORC}}
SET hive.default.fileformat=Orc

The parameters are all placed in the TBLPROPERTIES (see Create Table). They are:

Key	Default	Notes
orc.compress	ZLIB	high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size	262,144	number of bytes in each compression chunk
orc.stripe.size	26843545667,108,864	number of bytes in each stripe
orc.row.index.stride	10,000	number of rows between index entries (must be >= 1000)
orc.create.index	true	whether to create row indexes
orc.bloom.filter.columns	""	comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp	0.05	false positive probability for bloom filter (must >0.0 and <1.0)

For example, creating an ORC stored table without compression:

No Format
create table Addresses ( name string, street string, city string, state string, zip int ) stored as orc tblproperties ("orc.compress"="NONE");

Info

title	Version 0.14.0+: CONCATENATE

ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file, starting in Hive 0.14.0. The merge happens at the stripe level, which avoids decompressing and decoding the data.

Serialization and Compression

...

The codec can be Snappy, Zlib, or none.

ORC File Dump Utility

The ORC file dump utility analyzes ORC files. To invoke it, use this command:

Code Block

// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
 
// Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
 
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
 
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump] 
    [--backup-path <new-path>] <location-of-orc-file-or-directory>

Specifying -d in the command will cause it to dump the ORC file data rather than the metadata (Hive 1.1.0 and later).

Specifying --rowindex with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).

Specifying -t in the command will print the timezone id of the writer.

Specifying -j in the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata, add -p to the command.

Specifying --recover in the command will recover a corrupted ORC file generated by Hive streaming.

Specifying --skip-dump along with --recover will perform recovery without dumping metadata.

Specifying --backup-path with a new-path will let the recovery tool move corrupted files to the specified backup path (default: /tmp).

<location-of-orc-file> is the URI of the ORC file.

<location-of-orc-file-or-directory> is the URI of the ORC file or directory. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.

ORC Configuration Parameters

The ORC configuration parameters are described in Hive Configuration Properties – ORC File Format.

Anchor
orc-spec
orc-spec
ORC Format Specification

The ORC specification has moved to ORC project.

Space shortcuts

Child pages

Versions Compared

Old Version 2

New Version Current

Key

ORC Files

ORC File Format

Hive QL HiveQL Syntax

Serialization and Compression

ORC File Dump Utility

ORC Configuration Parameters

Anchor
orc-spec
orc-spec
ORC Format Specification

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 2

New Version Current

Key

ORC Files

ORC File Format

Hive QL HiveQL Syntax

Serialization and Compression

ORC File Dump Utility

ORC Configuration Parameters

Anchororc-specorc-specORC Format Specification

Anchor
orc-spec
orc-spec
ORC Format Specification