Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: minor fixes

...

In the order of granularity - Hive data is organized into:

  • Databases: Namespaces that separate tables and other data units from naming confliction.
  • Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):
    • timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed.
    • userid - which is of BIGINT type that identifies the user who viewed the page.
    • page_url - which is of STRING type that captures the location of the page.
    • referer_url - which is of STRING that captures the location of the page from where the user arrived at the current page.
    • IP - which is of STRING type that captures the IP address from where the page request was made.
  • Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience but it is the user's job to guarantee the relationship between partition name and data content!). Partition columns are virtual columns, they are not part of the data itself but are derived on load.
  • Buckets (or Cluster Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.

...

This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2 type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Apart from these fundamental rules for implicit conversion based on type system, Hive also allows the special case for conversion:

  • <STRING> => to <DOUBLE>

Explicit type conversion can be done using the cast operator as shown in the #Built in functions section below.

...

  • Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a
  • Wiki Markup
    Maps (key-value tuples): The elements are accessed using \['element name'\] notation. For example in a map M comprising of a mapping from 'group' \-> gid the gid value can be accessed using M\['group'\]
  • Wiki Markup
    Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the \[n\] notation where n is an index (zero-based) into the array. For example for an array A having the elements \['a', 'b', 'c'\], A\[1\] retruns 'b'.

Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields:

...

Operator

Operand types

Description

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="3a92850282a6b095-bf0f3722-48ee4bbe-8c66b6c5-2c3b0491a24821030cbdb9c7"><ac:plain-text-body><![CDATA[

A[n]

A is an Array and n is an int

returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'

]]></ac:plain-text-body></ac:structured-macro>

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="ed3b19feab6e5463-d2c9360c-4d974660-a7bf819c-6419c8dbb08a0ce635401ec1"><ac:plain-text-body><![CDATA[

M[key]

M is a Map<K, V> and key has type K

returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'

]]></ac:plain-text-body></ac:structured-macro>

S.x

S is a struct

returns the x field of S e.g for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct.

...

Return Type

Aggregation Function Name (Signature)

Description

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="983276b68495facf-62471374-41b844c1-8ac09484-4fc4b67a2a8edc72104327f1"><ac:plain-text-body><![CDATA[

BIGINT

count(*), count(expr), count(DISTINCT expr[, expr_.])

count(*) - Returns the total number of retrieved rows, including rows containing NULL values; count(expr) - Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

]]></ac:plain-text-body></ac:structured-macro>

DOUBLE

sum(col), sum(DISTINCT col)

returns the sum of the elements in the group or the sum of the distinct values of the column in the group

DOUBLE

avg(col), avg(DISTINCT col)

returns the average of the elements in the group or the average of the distinct values of the column in the group

DOUBLE

min(col)

returns the minimum value of the column in the group

DOUBLE

max(col)

returns the maximum value of the column in the group

...