Page History

...

Column1 (key)	Column2 (value)
1	NULL
1	1
2	2
3	3
3	NULL
4	5

The following query:

Code Block

	sql
	sql

SELECT key, value, GROUPING__ID, count(*)

...


FROM T1
GROUP BY key, value WITH ROLLUP;

will have the following results.:

Column 1 (key)	Column 2 (value)	GROUPING__ID	count(*)
NULL	NULL	3	6
1	NULL	0	2
1	NULL	1	1
1	1	0	1
2	NULL	1	1
2	2	0	1
3	NULL	0	2
3	NULL	1	1
3	3	0	1
4	NULL	1	1
4	5	0	1

...

This parameter decides if hive should add an additional map-reduce job. If the grouping set
cardinality (4 in the example above), is more than this value, a new MR job is added under the
assumption that the orginal group by will reduce the data size.

Grouping__ID function (before Hive 2.3.0)

Grouping__ID function was fixed in Hive 2.3.0, thus behavior before that release is different (this is expected). For each column, the function would return a value of "0" iif that column has been aggregated in that row, otherwise the value is "1".

Hence the following query:

Code Block

	sql
	sql

SELECT key, value, GROUPING__ID, count(*)
FROM T1
GROUP BY key, value WITH ROLLUP;

will have the following results.

Column 1 (key)	Column 2 (value)	GROUPING__ID	count(*)
NULL	NULL	0	6
1	NULL	1	2
1	NULL	3	1
1	1	3	1
2	NULL	1	1
2	2	3	1
3	NULL	1	2
3	NULL	3	1
3	3	3	1
4	NULL	1	1
4	5	3	1

Note that the third column is a bitvector of columns being selected.
For the first row, none of the columns are being selected.
For the second row, only the first column is being selected, which explains the count of 2.
For the third row, both the columns are being selected (and the second column happens to be null), which explains the count of 1.

Space shortcuts

Child pages

Versions Compared

Old Version 33

New Version 34

Key

Grouping__ID function (before Hive 2.3.0)