Table of Contents |
---|
Group By Syntax
Code Block |
---|
groupByClause: GROUP BY groupByExpression (, groupByExpression)*
groupByExpression: expression
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?
|
...
In order to count the number of rows in a table:
Code Block |
---|
SELECT COUNT(*) FROM table2;
|
...
In order to count the number of distinct users by gender one could write the following query:
Code Block |
---|
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
|
Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns. For example, the following is possible because count(DISTINCT) and sum(DISTINCT) specify the same column:
Code Block |
---|
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
|
...
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
Code Block |
---|
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
|
...
When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count
) in the select statement as well.
Let's take a simple example
Code Block |
---|
CREATE TABLE t1(a INTEGER, b INTGER);
|
A group by query on the above table could look like:
Code Block |
---|
SELECT
a,
sum(b)
FROM
t1
GROUP BY
a;
|
...
However, the query below DOES NOT work:
Code Block |
---|
SELECT
a,
b
FROM
t1
GROUP BY
a;
|
This is because the select clause has an additional column (b
) that is not included in the group by clause (and it's not an aggregation function either). This is because, if the table t1
looked like:
Code Block |
---|
a b
------
100 1
100 2
100 3
|
...
The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:
Code Block |
---|
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;
|
...
hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.
Code Block |
---|
set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;
|
...
Info | ||
---|---|---|
| ||
Grouping sets, CUBE and ROLLUP operators, and the GROUPING__ID function were added in Hive release 0.10.0. |
See Enhanced Aggregation, Cube, Grouping and Rollup for information about these aggregation operators.
...