Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

What Hive Is NOT

...

analysis easily. At the same time, Hive SQL also gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).  

What Hive Is NOT

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. .  It is best used for batch jobs over large sets of immutable data (like web logs)traditional data warehousing tasks.

In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of the QL language Hive's SQL with the help of some examples.

...

  • Databases: Namespaces that separate tables and other data units from naming confliction, views, and function to avoid naming conflicts.  Databases can also be used to enforce security for a user or group of users.
  • Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):
    • timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed.
    • userid - which is of BIGINT type that identifies the user who viewed the page.
    • page_url - which is of STRING type that captures the location of the page.
    • referer_url - which is of STRING that captures the location of the page from where the user arrived at the current page.
    • IP - which is of STRING type that captures the IP address from where the page request was made.
  • Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience but it is the user's job to guarantee the relationship between partition name and data content!). Partition columns are virtual columns, they are not part of the data itself but are derived on load.
  • Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.

...

  • Types are associated with the columns in the tables. The following Primitive types are supported:
  • Integers
    • TINYINT - 1 byte integer
    • SMALLINT - 2 byte integer
    • INT - 4 byte integer
    • BIGINT - 8 byte integer
  • Boolean type
    • BOOLEAN - TRUE/FALSE
  • Floating point numbers
    • FLOAT - single precision
    • DOUBLE - Double precision
  • Fixed point numbers
    • DECIMAL - a fixed point value of user defined scale and precision
  • String typetypes
    • STRING - sequence of characters in a specified character set
    • VARCHAR - sequence of characters in a specified character set with a maximum length
    • CHAR - sequence of characters in a specified character set with a defined length
  • Date and time types
    • TIMESTAMP - a specific point in time, up to nanosecond precision
    • DATE - a date
  • Binary types
    • BINARY - a sequence of bytes

The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):

...

Language Capabilities

Hive query language's SQL provides the basic SQL like operationsSQL operations. These operations work on tables or partitions. These operations are:

  • Ability to filter rows from a table using a where clause.
  • Ability to select certain columns from the table using a select clause.
  • Ability to do equi-joins between two tables.
  • Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.
  • Ability to store the results of a query into another table.
  • Ability to download the contents of a table to a local (e.g., nfs) directory.
  • Ability to store the results of a query in a hadoop dfs directory.
  • Ability to manage tables and partitions (create, drop and alter).
  • Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.

Usage and Examples

NOTE: Many of the following examples are out of date.  More up to date information can be found in the LanguageManual.  

The following examples highlight some salient features of the system. A detailed set of query test cases can be found at Hive Query Test Cases and the corresponding results can be found at Query Test Case Results.

...