Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the initial implementation, combiner will be supported only when all projections are either expressions on the group column or expressions on algebraic UDFs. This is because column pruning does not currently discard unused columns within a grouped-bag, and in such cases there will not be data size reduction happening because of the use of in-map combiner.

For the query -

Code Block

l = load 'x'   as (a,b,c); 
g = group l by a;
f = foreach g generate  group, COUNT(l.b);

The existing plan -

Code Block

Map Plan
g: Local Rearrange[tuple]{bytearray}(false) - scope-73
|   |
|   Project[bytearray][0] - scope-74
|
|---f: New For Each(false,false)[bag] - scope-61
    |   |
    |   Project[bytearray][0] - scope-62
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - scope-63
    |   |
    |   |---Project[bag][1] - scope-64
    |       |
    |       |---Project[bag][1] - scope-65
    |
    |---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-75
        |
        |---l: New For Each(false,false,false)[bag] - scope-47

...

2. Memory Management
Similar to the current strategy of InternalCachedBag, find the average size of the entries and estimate the size held by current number of entries. If the size exceeds the internal cached bag size threshold, it will write a portion of the hashmap entries to output.
In case of multi-query, there will be multiple such bags, and the memory limit will be shared equally between them.