= Hive Data Manipulation Language =
There are two primary ways of modifying data in Hive:
Loading files into tables
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Syntax
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Synopsis
Load operations are current pure copy/move operations that move datafiles into locations corresponding to Hive tables.
- filepath can be a
- relative path, eg:
project/data1
- absolute path, eg:
/user/hive/project/data1
- a full URI with scheme and (optionally) an authority, eg:
hdfs://namenode:9000/user/hive/project/data1
- relative path, eg:
- The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
- filepath can refer to a file (in which case hive will move the file into the table) or it can be a directory (in which case hive will move all the files within that directory into the table). In either case filepath addresses a set of files.
- If the keyword LOCAL is specified, then:
- the load command will look for filepath in the local file system. If a relative path is specified - it will be interpreted relative to the current directory of the user. User can specify a full URI for local files as well - for example:
file:///user/hive/project/data1
- the load command will try to copy all the files addressed by filepath to the target filesystem. The target file system is inferred by looking at the location attribute of the table. The copied data files will then be moved to the table.
- the load command will look for filepath in the local file system. If a relative path is specified - it will be interpreted relative to the current directory of the user. User can specify a full URI for local files as well - for example:
- If the keyword LOCAL is not specified, then Hive will either use the full URI of filepath if one is specified. Otherwise the following rules are applied:
- If scheme or authority are not specified, Hive will use the scheme and authority from hadoop configuration variable
fs.default.name
that specifies the Namenode URI. - If the path is not absolute - then Hive will interpret it relative to
/user/<username>
- Hive will move the files addressed by filepath into the table (or partition)
- If scheme or authority are not specified, Hive will use the scheme and authority from hadoop configuration variable
- if the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced with the files referred to by filepath. Otherwise the files referred by filepath will be added to the table.
- Note that if the target table (or partition) already has a file whose name collides with any of the filenames contained in filepath - then the existing file will be replaced with the new file.
Notes
- Note that if the target table (or partition) already has a file whose name collides with any of the filenames contained in filepath - then the existing file will be replaced with the new file.
- filepath cannot contain subdirectories.
- If we are not using the keyword LOCAL - filepath must refer to files within the same filesystem as the table (or partition's) location.
- Hive does some minimal checks to make sure that the files being loaded match the target table. Currently it checks that if the table is stored in sequencefile format - that the files being loaded are also sequencefiles and vice versa.
- Please read Hive-CompressedStorage if your datafile is compressed
Inserting data into Hive Tables from queries
Query Results can be inserted into tables by using the insert clause
Syntax
Standard syntax: INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement Hive extension (multiple inserts): FROM from_statement INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 [INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ... Hive extension (dynamic partition inserts): INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement
Synopsis
- Inserts can be done to a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
- Multiple insert clauses (also known as Multi Table Insert) can be specified in the same query
- The output of each of the select statements is written to the chosen table (or partition). Currently the OVERWRITE keyword is mandatory and implies that the contents of the chosen table or partition are replaced with the output of corresponding select statement.
- The output format and serialization class is determined by the table's metadata (as specified via DDL commands on the table)
- In the dynamic partition inserts, users can give partial partition specification, which means you just specify the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this static partition, otherwise dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column.
Notes
- Multi Table Inserts minimize the number of data scans required. Hive can insert data into multiple tables by scanning the input data just once (and applying different query operators) to the input data.
Writing data into filesystem from queries
Query results can be inserted into filesystem directories by using a slight variation of the syntax above:
Syntax
Standard syntax: INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ... Hive extension (multiple inserts): FROM from_statement INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
Synopsis
- directory can be full URI. If scheme or authority are not specified, Hive will use the scheme and authority from hadoop configuration variable
fs.default.name
that specifies the Namenode URI. - if LOCAL keyword is used - then Hive will write data to the directory on the local file system.
- Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type - then those columns are serialized to JSON format.
Notes
- Insert statements to directories, local directories and tables (or partitions) can all be used together within the same query.
- Inserts to HDFS filesystem directories is the best way to extract large amounts of data from Hive. Hive can write to HDFS directories in parallel from within a map-reduce job.