Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
  INSERT OVERWRITE TABLE pv_gender_agg
  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
  FROM pv_users
  GROUP BY pv_users.gender;

Select statement and group by clause

When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count) in the select statement as well.
Let's take a simple example

Code Block

CREATE TABLE t1(a INTEGER, b INTGER);

A group by query on the above table could look like:

Code Block

SELECT
   a,
   sum(b)
FROM
   t1
GROUP BY
   a;

The above query works because the select clause contains a (the group by key) and an aggregation function (sum(b)).

However, the query below DOES NOT work:

Code Block

SELECT
   a,
   b
FROM
   t1
GROUP BY
   a;

This is because the select clause has an additional column (b) that is not included in the group by clause (and it's not an aggregation function either). This is because, if the table t1 looked like:

Code Block

a    b
------
100  1
100  2
100  3

Since the grouping is only done on a, what value of b should Hive display for the group a=100? One can argue that it should be the first value or the lowest value but we all agree that there are multiple possible options. Hive does away with this guessing by making it invalid SQL (HQL, to be precise) to have a column in the select clause that is not included in the group by clause.

Advanced Features

Multi-Group-By Inserts

...