Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: minor edits & reformatting – hasNull flag and bloom filters

...

The goal of the column statistics is that for each column, the writer
records the count and depending on the type other useful fields. For
most of the primitive types, it records the minimum and maximum
values; and for numeric types it additionally stores the sum.
From Hive 1.1.0 onwards, the column statistics will also record if
there are any null value values within the row group by setting the hasNull flag.
The hasNull flag is used by ORC's predicate pushdown to better answer
'IS NULL' queries.

...

Info
titleVersion 1.2.0+: Bloom Filter

Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
Predicate pushdown can make use of bloom filters to better prune
the row groups that do not satisfy the filter condition.

The bloom filter indexes consist of a BLOOM_FILTER stream for each
columns column specified through 'orc.bloom.filter.columns' table properties.
A BLOOM_FILTER stream records a bloom filter entry for each row
group
(default to 10,000 rows) in a column. Only the row groups that
qualifies qualify min/max row index evaluation will be evaluated against the
bloom filter index.

Each BloomFilterEntry stores the number of hash functions ('k') used and
the bitset backing the bloom filter. The bitset is serialized as repeated
longs from which the number of bits ('m') for the bloom filter can be derived.
m = bitset.length * 64.

...

Bloom filter internally uses two different hash functions to map a key
to a position in the bit set. For tinyint, smallint, int, bigint, float
and double types, Thomas Wang's 64-bit integer hash function is used.
Floats are converted to IEEE-754 32 bit representation
(using Java's Float.floatToIntBits(float)). Similary, Doubles are
converted to IEEE-754 64 bit representation (using Java's
Double.doubleToLongBits(double)). All these primitive types
are casted cast to long base type before passing it onto being passed on to the hash function.
For strings and binary types, Murmur3 64 bit hash algorithm is used.
The 64 bit variant of Murmur3 considers only the most significant
8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
from the above algorithms is used as a base to derive 'k' different
hash functionfunctions. We use the idea mentioned in the paper "Less Hashing,
Same Performance: Building a Better Bloom Filter" by Kirsch et. al. to
quickly compute the k hashcodes.

The

...

algorithm

...

for

...

computing

...

k

...

hashcodes

...

and

...

setting

...

the

...

bit

...

position
in

...

a bloom

...

filter

...

is

...

as

...

follows:

...

  1. Get

...

  1. 64

...

  1. bit

...

  1. base

...

  1. hash

...

  1. code

...

  1. from

...

  1. Murmur3

...

  1. or

...

  1. Thomas

...

  1. Wang's

...

  1. hash

...

  1. algorithm.

...

  1. Split

...

  1. the

...

  1. above

...

  1. hashcode

...

  1. into

...

  1. two

...

  1. 32-bit

...

  1. hashcodes

...

  1. (say

...

  1. hash1

...

  1. and

...

  1. hash2).

...

  1. k'th

...

  1. hashcode

...

  1. is

...

  1. obtained

...

  1. by

...

  1. (where

...

  1. k

...

  1. >

...

  1. 0)

...

  1. :
      combinedHash = hash1 + (k * hash2)
  2. If combinedHash is negative flip all the bits:
      combinedHash = ~combinedHash
  3. Bit set position is obtained by performing modulo with m:
      position = combinedHash % m
  4. Set the position in bit set. The LSB 6 bits identifies the long index within bitset
    and bit position within the long uses little endian order.
      bitset[position >>> 6] |= (1L << position);

Bloom filter streams are interlaced with row group indexes. This placementplacement 
makes it convenient to read the bloom filter stream and row index stream
together in single read operation.