a
Group By Syntax
groupByClause: GROUP BY groupByExpression (, groupByExpression)* groupByExpression: expression groupByQuery: SELECT expression (, expression)* FROM src groupByClause?
Simple Examples
In order to count the number of rows in a table:
SELECT COUNT(*) FROM table2;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT.
In order to count the number of distinct users by gender one could write the following query:
INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender;
Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible
INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT.
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) FROM pv_users GROUP BY pv_users.gender;
Advanced Features
Multi-Group-By Inserts
The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:
FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum' SELECT pv_users.age, count(DISTINCT pv_users.userid) GROUP BY pv_users.age;
Map-side Aggregation for Group By
hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.
set hive.map.aggr=true; SELECT COUNT(*) FROM table2;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT.