Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Document list bucketing

...

Code Block
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name data_type [COMMENT col_comment], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) (Note: only available starting with 0.10.0)]
  [
   [ROW FORMAT row_format] [STORED AS file_format]
   | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  (Note:  only available starting with 0.6.0)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]  (Note:  only available starting with 0.6.0)
  [AS select_statement]  (Note: this feature is only available starting with 0.5.0, and is not supported when creating external tables.)

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path]

data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type (Note: Only available starting with Hive 0.7.0) - needs documentation

primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | STRING
  | BINARY (Note: Only available starting with Hive 0.8.0)
  | TIMESTAMP (Note: Only available starting with Hive 0.8.0)

array_type
  : ARRAY < data_type >

map_type
  : MAP < primitive_type, data_type >

struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

union_type
   : UNIONTYPE < data_type, data_type, ... >

row_format
  : DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

file_format:
  : SEQUENCEFILE
  | TEXTFILE
  | RCFILE     (Note: only available starting with 0.6.0)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

...

Use STORED BY to create a non-native table, for example in HBase. See StorageHandlers for more information on this option.

Partitioned tables

Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.

...

In contrast, the statement above creates a new empty_key_value_store table whose definition exactly matches the existing key_value_store in all particulars other than table name. The new table contains no rows.

Inserting Data Into Bucketed Tables

The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.

There is also an example of creating and populating bucketed tables.

Skewed tables

Info
titleVersion information

As of Hive 0.10.0 (HIVE-3072 & HIVE-3073

Info
titleDesign document

Read the design document for more information

This feature can be used to improve performance for tables where one or more columns have skewed values. By specifying the values that appear very often (heavy skew) Hive will split those out into separate files automatically and take this fact into account during queries so that it can skip (or include) whole files if possible.

This can be specified on a per-table level during table creation.

This is an example where we have one column with three skewed values:

Code Block

CREATE TABLE list_bucket_single (key STRING, value STRING) SKEWED BY (key) ON (1,5,6);

And here is an example of a table with two skewed columns:

Code Block

CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING) SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78));

Drop Table

Code Block
DROP TABLE [IF EXISTS] table_name

...