Table of Contents |
---|
a
Sampling Syntax
Sampling Bucketized Table
...
In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.
Code Block |
---|
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; |
Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.
...
Code Block |
---|
TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)
|
would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.
...
Code Block |
---|
TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)
|
would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.
For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.
Block Sampling
It is a feature that is still on trunk and is not yet in any release version.Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121
Code Block |
---|
block_sample: TABLESAMPLE (n PERCENT) |
...
In the following example the input size 0.1% or more will be used for the query.
Code Block |
---|
SELECT * FROM source TABLESAMPLE(0.1 PERCENT) s; |
Sometimes you want to sample the same data with different blocks, you can change this seed number:
Code Block |
---|
set hive.sample.seednumber=<INTEGER>;
|
Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
Code Block |
---|
block_sample: TABLESAMPLE (ByteLengthLiteral)
ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')
|
In the following example the input size 100M or more will be used for the query.
Code Block |
---|
SELECT *
FROM source TABLESAMPLE(100M) s;
|
Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
Code Block |
---|
block_sample: TABLESAMPLE (n ROWS)
|
For example, the following query will take the first 10 rows from each input split.
Code Block |
---|
SELECT * FROM source TABLESAMPLE(10 ROWS);
|