Page History

...

Table 1 - GROUPING SET queries and the equivalent GROUP BY queries

Aggregate Query with GROUPING SETS	Equivalent Aggregate Query with GROUP BY
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )	SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)	SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)	SELECT a, null, SUM( c ) FROM tab1 GROUP BY a UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )	SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b UNION SELECT null, null, SUM( c ) FROM tab1

Grouping__ID function

When aggregates are displayed for a column its value is null. This may conflict in case the column itself has some null values. There needs to be some way to identify NULL in column, which means aggregate and NULL in column, which means value. GROUPING__ID function is the solution to that.

This function returns a bitvector corresponding to whether each column is present or not. For each column, a value of "01" is produced for a row in
the result set if that column has been aggregated in that row, otherwise the value is "10". This can be used to differentiate when there are nulls
in the data.

Consider the following example:

Column1 (key)	Column2 (value)
1	NULL
1	1
2	2
3	3
3	NULL
4	5

The following query:

Code Block

	sql
	sql

SELECT key, value, GROUPING__ID, count(*)

...


FROM T1
GROUP BY key, value WITH ROLLUP;

will have the following results.:

Column 1 (key)	Column 2 (value)	GROUPING__ID	count(*)
NULL	NULL	3	6
1	NULL

0

1	2
1	NULL

1

0	1
1	1	0	1
2	NULL	1	1
2	2	0	1
3	NULL

0

1	2
3	NULL

1

0	1
3	3	0	1
4	NULL	1	1
4	5	0	1

Note that the third column is a bitvector of columns being selected.
For the first row, none of the columns are being selected.
For the second row, only the first column is being selected, which explains the value 1.
For the third row, both the columns are being selected (and the second column happens to be null), which explains the value 0.
For the third row, only the first column is being selected, which explains the value 1.

Grouping function

The grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row. The value 0 represents a column that is part of the grouping set, while the value 1 represents a column that is not part of the grouping set.

Going back to our example above, consider the following query: SELECT

Code Block

	sql
	sql

SELECT key, value, GROUPING__ID,
  grouping(key, value),

...

 grouping(value, key),

...

 grouping(key), grouping(value),
  count(*)

...


FROM T1
GROUP BY key, value WITH ROLLUP;

This query will produce the following results.

Column 1 (key)	Column 2 (value)	GROUPING__ID	grouping(key, value)	grouping(value, key)	grouping(key)	grouping(value)	count(*)
NULL	NULL	3	3	3	1	1	6
1	NULL

0

1

0

1

0

2

0

1	2
1	NULL

1

0

1

0

2

0

1

0	1
1	1	0	0	0	0	0	1
2	NULL	1	1	2	0	1	1
2	2	0	0	0	0	0	1
3	NULL

0

1

0

1

0

2

0

1	2
3	NULL

1

0

1

0

2

0

1

0	1
3	3	0	0	0	0	0	1
4	NULL	1	1	2	0	1	1
4	5	0	0	0	0	0	1

Cubes and Rollups

The general syntax is WITH CUBE/ROLLUP. It is used with the GROUP BY only. CUBE creates a subtotal of all possible combinations of the set of column in its argument. Once we compute a CUBE on a set of dimension, we can get answer to all possible aggregation questions on those dimensions.

...

This parameter decides if hive should add an additional map-reduce job. If the grouping set
cardinality (4 in the example above), is more than this value, a new MR job is added under the
assumption that the orginal group by will reduce the data size.

Grouping__ID function (before Hive 2.3.0)

Grouping__ID function was fixed in Hive 2.3.0, thus behavior before that release is different (this is expected). For each column, the function would return a value of "0" iif that column has been aggregated in that row, otherwise the value is "1".

Hence the following query:

Code Block

	sql
	sql

SELECT key, value, GROUPING__ID, count(*)
FROM T1
GROUP BY key, value WITH ROLLUP;

will have the following results.

Column 1 (key)	Column 2 (value)	GROUPING__ID	count(*)
NULL	NULL	0	6
1	NULL	1	2
1	NULL	3	1
1	1	3	1
2	NULL	1	1
2	2	3	1
3	NULL	1	2
3	NULL	3	1
3	3	3	1
4	NULL	1	1
4	5	3	1

Note that the third column is a bitvector of columns being selected.
For the first row, none of the columns are being selected.
For the second row, only the first column is being selected, which explains the count of 2.
For the third row, both the columns are being selected (and the second column happens to be null), which explains the count of 1.

Space shortcuts

Child pages

Versions Compared

Old Version 32

New Version Current

Key

Grouping__ID function

Grouping function

Cubes and Rollups

Grouping__ID function (before Hive 2.3.0)