...
Table of Content Zone | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Create Table
CREATE TABLE creates a table with the given name. An error CREATE TABLE creates a table with the given name. An error is thrown if a table or view with the same name already exists. You can use IF NOT EXISTS to skip the error.
See Alter Table below for more information about table comments, table properties, and SerDe properties. See Type System and Hive Data Types for details about the primitive and complex data types. Managed and External TablesBy default Hive creates managed tables, where files, metadata and statistics are managed by internal Hive processes. For details on the differences between managed and external table see Managed vs. External Tables. Storage Formats
Hive supports built-in and custom-developed file formats. See CompressedStorage for details on compressed table storage.
|
Row Format | Description | ||||||
---|---|---|---|---|---|---|---|
RegEx | Stored as plain text file, translated by Regular Expression. The following example defines a table in the default Apache Weblog format. CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?" ) STORED AS TEXTFILE; | ||||||
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE | Stored as plain text file in JSON format. The JsonSerDe for JSON files is available in Hive 0.12 and later. In some distributions, a reference to hive-hcatalog-core.jar is required. ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar; TABLE my_table(a string, b bigint , ...) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE; The JsonSerDe was moved to Hive from HCatalog and before it was in hive-contrib project. It was added to the Hive distribution by HIVE-4895. An Amazon SerDe is available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar for releases prior to 0.12.0.The JsonSerDe for JSON files is available in Hive 0.12 and later. Starting in Hive 3.0.0, JsonSerDe is added to Hive Serde as "org.apache.hadoop.hive.serde2.JsonSerDe" (HIVE-19211).
Or
| ||||||
CSV/TSV ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE | Stored as plain text file in CSV / TSV format. The CSVSerde is available in Hive 0.14 and greater. The following example creates a TSV (Tab-separated) file.
'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "\t" , "quoteChar" = "'" , "escapeChar" = "\\" ) STORED AS TEXTFILE; Default properties for SerDe is Comma-Separated (CSV) file DEFAULT_ESCAPE_CHARACTER \ DEFAULT_QUOTE_CHARACTER " DEFAULT_SEPARATOR , This SerDe works for most CSV data, but does not handle embedded newlines. To use the SerDe, specify the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde. Documentation is based on original documentation at https://github.com/ogrodnek/csv-serde. This SerDe treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type. The type information is retrieved from the SerDe. To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type. The CSV SerDe is based on https://github.com/ogrodnek/csv-serde, and was added to the Hive distribution in HIVE-7777. The CSVSerde has been built and tested against Hive 0.14 and later, and uses Open-CSV 2.3 which is bundled with the Hive distribution.For general information about SerDes, see Hive SerDe in the Developer Guide. Also see SerDe for details about input and output processing. |
Partitioned Tables
Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
If, when creating a partitioned table, you get this error: "FAILED: Error in semantic analysis: Column repeated in partitioning columns," it means you are trying to include the partitioned column in the data of the table itself. You probably really do have the column defined. However, the partition you create makes a pseudocolumn on which you can query, so you must rename your table column to something else (that users should not query on!).
For example, suppose your original unpartitioned table had three columns: id, date, and name.
Code Block | ||
---|---|---|
| ||
id int, date date, name varchar |
Now you want to partition on date. Your Hive definition could use "dtDontQuery" as a column name so that "date" can be used for partitioning (and querying).
Code Block | ||
---|---|---|
| ||
create table table_name ( id int, dtDontQuery string, name string ) partitioned by (date string) |
Now your users will still query on "where date = '...'
" but the second column dtDontQuery will hold the original values.
Here's an example statement to create a partitioned table:
Code Block | ||
---|---|---|
| ||
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; |
The statement above creates the page_view table with viewTime, userid, page_url, referrer_url, and ip columns (including comments). The table is also partitioned and data is stored in sequence files. The data format in the files is assumed to be field-delimited by ctrl-A and row-delimited by newline.
Code Block | ||
---|---|---|
| ||
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' STORED AS SEQUENCEFILE; |
The above statement lets you create the same table as the previous table.
In the previous examples the data is stored in <hive.metastore.warehouse.dir>/page_view. Specify a value for the key hive.metastore.warehouse.dir
in the Hive config file hive-site.xml.
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system. Starting Hive 4.0.0 (
Jira | ||||||
---|---|---|---|---|---|---|
|
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir
.
Code Block | ||
---|---|---|
| ||
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS TEXTFILE LOCATION '<hdfs_location>'; |
You can use the above statement to create a page_view table which points to any HDFS location for its storage. But you still have to make sure that the data is delimited as specified in the CREATE statement above.
For another example of creating an external table, see Loading Data in the Tutorial.
Create Table As Select (CTAS)
Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS) statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the query results are populated. So other users will either see the table with the complete results of the query or will not see the table at all.
There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with other table properties such as the SerDe and storage format.
Starting with Hive 3.2.0, CTAS statements can define a partitioning specification for the target table (HIVE-20241).
CTAS has these restrictions:
- The target table cannot be an external table.
- The target table cannot be a list bucketing table.
Code Block | ||
---|---|---|
| ||
CREATE TABLE new_key_value_store ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" STORED AS RCFile AS SELECT (key % 1024) new_key, concat(key, value) key_value_pair FROM key_value_store SORT BY new_key, key_value_pair; |
The above CTAS statement creates the target table new_key_value_store with the schema (new_key DOUBLE, key_value_pair STRING) derived from the results of the SELECT statement. If the SELECT statement does not specify column aliases, the column names will be automatically assigned to _col0, _col1, and _col2 etc. In addition, the new target table is created using a specific SerDe and a storage format independent of the source tables in the SELECT statement.
Starting with Hive 0.13.0, the SELECT statement can include one or more common table expressions (CTEs), as shown in the SELECT syntax. For an example, see Common Table Expression.
Being able to select data from one table to another is one of the most powerful features of Hive. Hive handles the conversion of the data from the source format to the destination format as the query is being executed.
Create Table Like
The LIKE form of CREATE TABLE allows you to copy an existing table definition exactly (without copying its data). In contrast to CTAS, the statement below creates a new empty_key_value_store table whose definition exactly matches the existing key_value_store in all particulars other than table name. The new table contains no rows.
Code Block |
---|
CREATE TABLE empty_key_value_store LIKE key_value_store [TBLPROPERTIES (property_name=property_value, ...)]; |
Before Hive 0.8.0, CREATE TABLE LIKE view_name would make a copy of the view. In Hive 0.8.0 and later releases, CREATE TABLE LIKE view_name creates a table by adopting the schema of view_name (fields and partition columns) using defaults for SerDe and file formats.
Bucketed Sorted Tables
Code Block | ||
---|---|---|
| ||
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' STORED AS SEQUENCEFILE; |
In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps.
The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
There is also an example of creating and populating bucketed tables.
Skewed Tables
Info | ||
---|---|---|
| ||
As of Hive 0.10.0 (HIVE-3072 and HIVE-3649). See HIVE-3026 for additional JIRA tickets that implemented list bucketing in Hive 0.10.0 and 0.11.0. |
Info | ||
---|---|---|
| ||
Read the Skewed Join Optimization and List Bucketing design documents for more information. |
This feature can be used to improve performance for tables where one or more columns have skewed values. By specifying the values that appear very often (heavy skew) Hive will split those out into separate files (or directories in case of list bucketing) automatically and take this fact into account during queries so that it can skip or include the whole file (or directory in case of list bucketing) if possible.
This can be specified on a per-table level during table creation.
The following example shows one column with three skewed values, optionally with the STORED AS DIRECTORIES clause which specifies list bucketing.
Code Block | ||
---|---|---|
| ||
CREATE TABLE list_bucket_single (key STRING, value STRING) SKEWED BY (key) ON (1,5,6) [STORED AS DIRECTORIES]; |
And here is an example of a table with two skewed columns.
Code Block | ||
---|---|---|
| ||
CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING) SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78)) [STORED AS DIRECTORIES]; |
For corresponding ALTER TABLE statements, see Alter Table Skewed or Stored as Directories below.
Temporary Tables
Info | ||
---|---|---|
| ||
As of Hive 0.14.0 (HIVE-7090). |
A table that has been created as a temporary table will only be visible to the current session. Data will be stored in the user's scratch directory, and deleted at the end of the session.
If a temporary table is created with a database/table name of a permanent table which already exists in the database, then within that session any references to that table will resolve to the temporary table, rather than to the permanent table. The user will not be able to access the original table within that session without either dropping the temporary table, or renaming it to a non-conflicting name.
Temporary tables have the following limitations:
- Partition columns are not supported.
- No support for creation of indexes.
Starting in Hive 1.1.0 the storage policy for temporary tables can be set to memory
, ssd
, or default
with the hive.exec.temporary.table.storage configuration parameter (see HDFS Storage Types and Storage Policies).
Code Block | ||
---|---|---|
| ||
CREATE TEMPORARY TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING); |
Transactional Tables
Info | ||
---|---|---|
| ||
As of Hive 4.0 (HIVE-18453). |
A table that supports operations with ACID semantics. See this for more details about transactional tables.
Code Block | ||
---|---|---|
| ||
CREATE TRANSACTIONAL TABLE transactional_table_test(key string, value string) PARTITIONED BY(ds string) STORED AS ORC; |
Constraints
Info | ||
---|---|---|
| ||
As of Hive 2.1.0 (HIVE-13290). |
Hive includes support for non-validated primary and foreign key constraints. Some SQL tools generate more efficient queries when constraints are present. Since these constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive.
Code Block | ||
---|---|---|
| ||
create table pk(id1 integer, id2 integer, primary key(id1, id2) disable novalidate); create table fk(id1 integer, id2 integer, constraint c1 foreign key(id1, id2) references pk(id2, id1) disable novalidate); disable novalidate); |
Info | ||
---|---|---|
| ||
As of Hive 3.0.0 (HIVE-16575, HIVE-18726, HIVE-18953). |
Hive includes support for UNIQUE, NOT NULL, DEFAULT and CHECK constraints. Beside UNIQUE all three type of constraints are enforced.
Code Block | ||
---|---|---|
| ||
create table constraints1(id1 integer UNIQUE disable novalidate, id2 integer NOT NULL,
usr string DEFAULT current_user(), price double CHECK (price > 0 AND price <= 1000));
create table constraints2(id1 integer, id2 integer,
constraint c1_unique UNIQUE(id1) disable novalidate);
create table constraints3(id1 integer, id2 integer,
constraint c1_check CHECK(id1 + id2 > 0));
|
DEFAULT on complex data types such as map, struct, array is not supported.
Drop Table
Code Block |
---|
DROP TABLE [IF EXISTS] table_name [PURGE]; -- (Note: PURGE available in Hive 0.14.0 and later) |
DROP TABLE removes metadata and data for this table. The data is actually moved to the .Trash/Current directory if Trash is configured (and PURGE is not specified). The metadata is completely lost.
When dropping an EXTERNAL table, data in the table will NOT be deleted from the file system. Starting Hive 4.0.0 (
Jira | ||||||
---|---|---|---|---|---|---|
|
When dropping a table referenced by views, no warning is given (the views are left dangling as invalid and must be dropped or recreated by the user).
Otherwise, the table information is removed from the metastore and the raw data is removed as if by 'hadoop dfs -rm'. In many cases, this results in the table data being moved into the user's .Trash folder in their home directory; users who mistakenly DROP TABLEs may thus be able to recover their lost data by recreating a table with the same schema, recreating any necessary partitions, and then moving the data back into place manually using Hadoop. This solution is subject to change over time or across installations as it relies on the underlying implementation; users are strongly encouraged not to drop tables capriciously.
Info | ||
---|---|---|
| ||
The PURGE option is added in version 0.14.0 by HIVE-7100. |
If PURGE is specified, the table data does not go to the .Trash/Current directory and so cannot be retrieved in the event of a mistaken DROP. The purge option can also be specified with the table property auto.purge (see TBLPROPERTIES above).
In Hive 0.7.0 or later, DROP returns an error if the table doesn't exist, unless IF EXISTS is specified or the configuration variable hive.exec.drop.ignorenonexistent is set to true.
See the Alter Partition section below for how to drop partitions.
Truncate Table
Info | ||
---|---|---|
| ||
As of Hive 0.11.0 (HIVE-446). |
Code Block |
---|
TRUNCATE TABLE table_name [PARTITION partition_spec]; partition_spec: : (partition_column = partition_col_value, partition_column = partition_col_value, ...) |
Removes all rows from a table or partition(s). The rows will be trashed if the filesystem Trash is enabled, otherwise they are deleted (as of Hive 2.2.0 with HIVE-14626). Currently the target table should be native/managed table or an exception will be thrown. User can specify partial partition_spec for truncating multiple partitions at once and omitting partition_spec will truncate all partitions in the table.
Starting with HIVE 2.3.0 (HIVE-15880) if the table property "auto.purge" (see TBLPROPERTIES above) is set to "true" the data of the table is not moved to Trash when a TRUNCATE TABLE command is issued against it and cannot be retrieved in the event of a mistaken TRUNCATE. This is applicable only for managed tables (see managed tables). This behavior can be turned off if the "auto.purge" property is unset or set to false for a managed table.
...
Table of Content Zone | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alter table statements enable you to change the structure of an existing table. You can add columns/partitions, change SerDe, add table and SerDe properties, or rename the table itself. Similarly, alter table partition statements allow you change the properties of a specific partition in the named table. Alter TableRename Table
This statement lets you change the name of a table to a different name. As of version 0.6, a rename on a managed table moves its HDFS location. Rename has been changed as of version 2.2.0 (HIVE-14909) so that a managed table's HDFS location is moved only if the table is created without a LOCATION clause and under its database directory. Hive versions prior to 0.6 just renamed the table in the metastore without moving the HDFS location. Alter Table Properties
You can use this statement to add your own metadata to the tables. Currently last_modified_user, last_modified_time properties are automatically added and managed by Hive. Users can add their own properties to this list. You can do DESCRIBE EXTENDED TABLE to get this information. For more information, see the TBLPROPERTIES clause in Create Table above. Alter Table CommentTo change the comment of a table you have to change the
Add SerDe Properties
These statements enable you to change a table's SerDe or add user-defined metadata to the table's SerDe object. The SerDe properties are passed to the table's SerDe when it is being initialized by Hive to serialize and deserialize data. So users can store any information required for their custom SerDe here. Refer to the SerDe documentation and Hive SerDe in the Developer Guide for more information, and see Row Format, Storage Format, and SerDe above for details about setting a table's SerDe and SERDEPROPERTIES in a CREATE TABLE statement. Note that both
Alter Table Storage Properties
These statements change the table's physical storage properties. NOTE: These commands will only modify Hive's metadata, and will NOT reorganize or reformat existing data. Users should make sure the actual data layout conforms with the metadata definition. Alter Table Skewed or Stored as Directories
A table's SKEWED and STORED AS DIRECTORIES options can be changed with ALTER TABLE statements. See Skewed Tables above for the corresponding CREATE TABLE syntax. Alter Table Skewed
The STORED AS DIRECTORIES option determines whether a skewed table uses the list bucketing feature, which creates subdirectories for skewed values. Alter Table Not Skewed
The NOT SKEWED option makes the table non-skewed and turns off the list bucketing feature (since a list-bucketing table is always skewed). This affects partitions created after the ALTER statement, but has no effect on partitions created before the ALTER statement. Alter Table Not Stored as Directories
This turns off the list bucketing feature, although the table remains skewed. Alter Table Set Skewed Location
This changes the location map for list bucketing. Alter Table Constraints
Table constraints can be added or removed via ALTER TABLE statements.
Additional Alter Table StatementsSee Alter Either Table or Partition below for more DDL statements that alter tables. Alter PartitionPartitions can be added, renamed, exchanged (moved), dropped, or (un)archived by using the PARTITION clause in an ALTER TABLE statement, as described below. To make the metastore aware of partitions that were added directly to HDFS, you can use the metastore check command (MSCK) or on Amazon EMR you can use the RECOVER PARTITIONS option of ALTER TABLE. See Alter Either Table or Partition below for more ways to alter partitions.
Add Partitions
You can use ALTER TABLE ADD PARTITION to add partitions to a table. Partition values should be quoted only if they are strings. The location must be a directory inside of which data files reside. (ADD PARTITION changes the table metadata, but does not load data. If the data does not exist in the partition's location, queries will not return any results.) An error is thrown if the partition_spec for the table already exists. You can use IF NOT EXISTS to skip the error.
Specifically, the following example will FAIL silently and without error in Hive 0.7, and all queries will go only to dt='2008-08-08' partition, no matter which partition you specify.
In Hive 0.8 and later, you can add multiple partitions in a single ALTER TABLE statement as shown in the previous example. In Hive 0.7, if you want to add many partitions you should use the following form:
Dynamic PartitionsPartitions can be added to a table dynamically, using a Hive INSERT statement (or a Pig STORE statement). See these documents for details and examples:
Rename Partition
This statement lets you change the value of a partition column. One of use cases is that you can use this statement to normalize your legacy partition column value to conform to its type. In this case, the type conversion and normalization are not enabled for the column values in old partition_spec even with property hive.typecheck.on.insert set to true (default) which allows you to specify any legacy data in form of string in the old partition_spec. Exchange PartitionPartitions can be exchanged (moved) between tables.
This statement lets you move the data in a partition from a table to another table that has the same schema and does not already have that partition. Discover PartitionsAutomatically discovers and synchronizes the metadata of the partition in Hive Metastore. When External Partitioned Tables are created, "discover.partitions"="true" table property gets automatically added. For managed partitioned tables, "discover.partitions" table property can be manually added. When Hive Metastore Service (HMS) is started in remote service mode, a background thread (PartitionManagementTask) gets scheduled periodically every 300s (configurable via metastore.partition.management.task.frequency config) that looks for tables with "discover.partitions" table property set to true and performs msck repair in sync mode. If the table is a transactional table, then Exclusive Lock is obtained for that table before performing msck repair. With this table property, "MSCK REPAIR TABLE table_name SYNC PARTITIONS" is no longer required to be run manually.
Partition RetentionTable property "partition.retention.period" can now be specified for partitioned tables with a retention interval. When a retention interval is specified, the background thread running in HMS (refer Discover Partitions section), will check the age (creation time) of the partition and if the partition's age is older than the retention period, it will be dropped. Dropping partitions after retention period will also delete the data in that partition. For example, if an external partitioned table with 'date' partition is created with table properties "discover.partitions"="true" and "partition.retention.period"="7d" then only the partitions created in last 7 days are retained.
Recover Partitions (MSCK REPAIR TABLE)Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using However, users can run a metastore check command with the repair table option:
which will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS. See HIVE-874 and HIVE-17824 for more details. When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once. MSCK command without the REPAIR option can be used to find details about metadata mismatch metastore. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
Starting with Hive 1.3, MSCK will throw exceptions if directories with disallowed characters in partition values are found on HDFS. Use hive.msck.path.validation setting on the client to alter this behavior; "skip" will simply skip the directories. "ignore" will try to create partitions anyway (old behavior). This may or may not work. Drop Partitions
You can use ALTER TABLE DROP PARTITION to drop a partition for a table. This removes the data and metadata for this partition. The data is actually moved to the .Trash/Current directory if Trash is configured, unless PURGE is specified, but the metadata is completely lost (see Drop Table above).
For tables that are protected by NO_DROP CASCADE, you can use the predicate IGNORE PROTECTION to drop a specified partition or set of partitions (for example, when splitting a table between two Hadoop clusters):
The above command will drop that partition regardless of protection stats.
If PURGE is specified, the partition data does not go to the .Trash/Current directory and so cannot be retrieved in the event of a mistaken DROP:
The purge option can also be specified with the table property auto.purge (see TBLPROPERTIES above). In Hive 0.7.0 or later, DROP returns an error if the partition doesn't exist, unless IF EXISTS is specified or the configuration variable hive.exec.drop.ignorenonexistent is set to true.
(Un)Archive Partition
Archiving is a feature to moves a partition's files into a Hadoop Archive (HAR). Note that only the file count will be reduced; HAR does not provide any compression. See LanguageManual Archiving for more information Alter Either Table or PartitionAlter Table/Partition File Format
This statement changes the table's (or partition's) file format. For available file_format options, see the section above on CREATE TABLE. The operation only changes the table metadata. Any conversion of existing data must be done outside of Hive. Alter Table/Partition Location
Alter Table/Partition Touch
TOUCH reads the metadata, and writes it back. This has the effect of causing the pre/post execute hooks to fire. An example use case is if you have a hook that logs all the tables/partitions that were modified, along with an external script that alters the files on HDFS directly. Since the script modifies files outside of hive, the modification wouldn't be logged by the hook. The external script could call TOUCH to fire the hook and mark the said table or partition as modified. Also, it may be useful later if we incorporate reliable last modified times. Then touch would update that time as well. Note that TOUCH doesn't create a table or partition if it doesn't already exist. (See Create Table.) Alter Table/Partition Protections
Protection on data can be set at either the table or partition level. Enabling NO_DROP prevents a table from being dropped. Enabling OFFLINE prevents the data in a table or partition from being queried, but the metadata can still be accessed. If any partition in a table has NO_DROP enabled, the table cannot be dropped either. Conversely, if a table has NO_DROP enabled then partitions may be dropped, but with NO_DROP CASCADE partitions cannot be dropped either unless the drop partition command specifies IGNORE PROTECTION. Alter Table/Partition Compact
In general you do not need to request compactions when Hive transactions are being used, because the system will detect the need for them and initiate the compaction. However, if compaction is turned off for a table or you want to compact the table at a time the system would not choose to, ALTER TABLE can initiate the compaction. By default the statement will enqueue a request for compaction and return. To watch the progress of the compaction, use SHOW COMPACTIONS. As of Hive 2.2.0 "AND WAIT" may be specified to have the operation block until compaction completes. The compaction_type can be MAJOR or MINOR. See the Basic Design section in Hive Transactions for more information. Alter Table/Partition Concatenate
If the table or partition contains many small RCFiles or ORC files, then the above command will merge them into larger files. In case of RCFile the merge happens at block level whereas for ORC files the merge happens at stripe level thereby avoiding the overhead of decompressing and decoding the data. Alter Table/Partition Update columns
Tables that have serdes which self-describe the table schema may have different schemas in reality and the ones stored in Hive Metastore. For example when a user creates an Avro stored table using a schema url or schema literal, the schema will be inserted into HMS and then will never be changed in HMS regardless of url or literal changes within the serde. This can lead to problems especially when integrating with other Apache components. The update columns feature provides a way for the user to let any schema changes made in the serde to be synced into HMS. It works on both the table and the partitions levels, and obviously only for tables whose schema is not tracked by HMS (see metastore.serdes.using.metastore.for.schema). Using the command on these latter serde types will result in error. Alter ColumnRules for Column NamesColumn names are case insensitive.
Change Column Name/Type/Position/Comment
This command will allow users to change a column's name, data type, comment, or position, or an arbitrary combination of them. The PARTITION clause is available in Hive 0.14.0 and later; see Upgrading Pre-Hive 0.13.0 Decimal Columns for usage. A patch for Hive 0.13 is also available (see HIVE-7971). The CASCADE|RESTRICT clause is available in Hive 1.1.0. ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column change only to table metadata.
Add/Replace Columns
ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns. This is supported for Avro backed tables as well, for Hive 0.14 and later. REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe). Refer to Hive SerDe for more information. REPLACE COLUMNS can also be used to drop columns. For example, " The PARTITION clause is available in Hive 0.14.0 and later; see Upgrading Pre-Hive 0.13.0 Decimal Columns for usage. The CASCADE|RESTRICT clause is available in Hive 1.1.0. ALTER TABLE ADD|REPLACE COLUMNS with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column changes only to table metadata.
Partial Partition SpecificationAs of Hive 0.14 (HIVE-8411), users are able to provide a partial partition spec for certain above alter column statements, similar to dynamic partitioning. So rather than having to issue an alter column statement for each partition that needs to be changed:
... you can change many existing partitions at once using a single ALTER statement with a partial partition specification:
Similar to dynamic partitioning, hive.exec.dynamic.partition must be set to true to enable use of partial partition specs during ALTER PARTITION. This is supported for the following operations:
|
...