Introduction
Objectives
HCatMix is the performance testing framework for hcatalog.
The objective is:
- Establish the baseline performance numbers, in order to monitor change in performance as new releases are made
- Find out overhead of using HCatalog
- Understand limitations of HCatalog (e.g.: number of parallel connections that can be handled by HCatalog Server, number of partitions that can be added in a pig job) etc.
In order to meet the above objective following would be measured:
- Time taken for basic use case operations (create table, show database etc) with increasing number of threads.
- Find the overhead of using HCatalog Loader/Storer in pig over default
PigLoader/PigStorer
for various data size and number of partitions.
Touch points
Following are the touch points, where HCatalog code gets executed when it is being used in pig.
Other operations in pig doesn't trigger anything in HCatalog.
- Using the
HcatStorer
- Using the
HcatLoader
filter by
clause on partitioned table
Implementation
Test setup
Test setup needs to perform two tasks:
- Create HCatalog tables by providing the schema
- Generate data that conforms to the schema to be loaded in HCatalog
Both of these are driven by configuration file. Following is an example of setup xml file.
!HCatMix Test Setup Configuration
<database> <tables> <table> <namePrefix>page_views_1brandnew</namePrefix> <dbName>default</dbName> <columns> <column> <name>user</name> <type>STRING</type> <avgLength>20</avgLength> <distribution>zipf</distribution> <cardinality>1000000</cardinality> <percentageNull>10</percentageNull> </column> <column> <name>timespent</name> <type>INT</type> <distribution>zipf</distribution> <cardinality>5000000</cardinality> <percentageNull>25</percentageNull> </column> <column> <name>query_term</name> <type>STRING</type> <avgLength>5</avgLength> <distribution>zipf</distribution> <cardinality>10000</cardinality> <percentageNull>0</percentageNull> </column> . . . </columns> <partitions> <column> <name>timestamp</name> <type>INT</type> <distribution>zipf</distribution> <cardinality>1000</cardinality> </column> <partition> <name>action</name> <type>INT</type> <distribution>zipf</distribution> <cardinality>8</cardinality> </partition> . . . </partitions> <instances> <instance> <size>1000000</size> <count>1</count> </instance> <instance> <size>100000</size> <count>1</count> </instance> . . . </instances> </table> </tables> </database>
A column/parition has the following details:
name
: of the columntype
: Type of data (string/int/map
etc)avgLength
: average length if the type is stringdistribution
distribution type. Eitheruniform
orzipf
to generate data that follows Zipf's distribution (http://en.wikipedia.org/wiki/Zipf's_law)cardinality
Size of the sample spacepercentageNull
what percentage should be null
The instances section defines how many instance of table with the same specification to be created and the number of rows for each of them.
Test plan
- There would be four tables of varying size 100MB/5GB/50GB and 100GB. There would be three tables for each size, one with no partition column, one with 3 partition column and 1 with 5 partition column.
Data
Number of partitions
Number of partition column
None
3
5
100MB
100
1
1
1
5GB
5k
1
1
1
50GB
50k
1
1
1
100GB
100k
1
1
1
The following pig scripts would be executed and the time taken monitored. The pig scripts would have only Load, Store and filter by operations:
1. Load data using PigStorage then store to an HCat table using !HCatStorage(). Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatStorage()
.
1. Use dynamic partitioning
1. Use static partitioning
1. Load data from !HCatTable using HCatLoader()
and store it using PigStorage()
. Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatLoader()
.
1. Use =filter by= on one of the partition columns of the tables, compare this with =filter by= on normal files.
Repeat the experiment for all the mentioned tables 10 times each and find the average time.
For concurrency test a client would be doing show partitions repeatedly. And the number of such concurrent client would be increased from 1 to 255 in steps of 5. The throughput would be recorded.
Questions
- Does these comparison need to be done on
RCFiles
vstext files
? - Do we need a clean setup for each of the experiments or generating the date and then doing the tests is okay