HCatMix

Introduction

Objectives

HCatMix is the performance testing framework for hcatalog.

The objective is:

Establish the baseline performance numbers, in order to monitor change in performance as new releases are made
Find out overhead of using HCatalog
Understand limitations of HCatalog (e.g.: number of parallel connections that can be handled by HCatalog Server, number of partitions that can be added in a pig job) etc.

In order to meet the above objective following would be measured:

Time taken for basic use case operations (create table, show database etc) with increasing number of threads.
Find the overhead of using HCatalog Loader/Storer in pig over default PigLoader/PigStorer for various data size and number of partitions.

Touch points

Following are the touch points, where HCatalog code gets executed when it is being used in pig.
Other operations in pig doesn't trigger anything in HCatalog.

Using the HcatStorer
Using the HcatLoader
filter by clause on partitioned table

Implementation

Test setup

Test setup needs to perform two tasks:

Create HCatalog tables by providing the schema
Generate data that conforms to the schema to be loaded in HCatalog

Both of these are driven by configuration file. Following is an example of setup xml file.

!HCatMix Test Setup Configuration

<database>
    <tables>
        <table>
            <namePrefix>page_views_1brandnew</namePrefix>
            <dbName>default</dbName>
            <columns>
                <column>
                    <name>user</name>
                    <type>STRING</type>
                    <avgLength>20</avgLength>
                    <distribution>zipf</distribution>
                    <cardinality>1000000</cardinality>
                    <percentageNull>10</percentageNull>
                </column>
                <column>
                    <name>timespent</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>5000000</cardinality>
                    <percentageNull>25</percentageNull>
                </column>
                <column>
                    <name>query_term</name>
                    <type>STRING</type>
                    <avgLength>5</avgLength>
                    <distribution>zipf</distribution>
                    <cardinality>10000</cardinality>
                    <percentageNull>0</percentageNull>
                </column>
                 .
                 .
                 .
            </columns>
            <partitions>
                <column>
                    <name>timestamp</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>1000</cardinality>
                </column>
                <partition>
                    <name>action</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>8</cardinality>
                </partition>
                 .
                 .
                 .
            </partitions>
            <instances>
                <instance>
                    <size>1000000</size>
                    <count>1</count>
                </instance>
                <instance>
                    <size>100000</size>
                    <count>1</count>
                </instance>
                 .
                 .
                 .
            </instances>
        </table>
    </tables>
</database>

A column/parition has the following details:

name: of the column
type: Type of data (string/int/map etc)
avgLength: average length if the type is string
distribution distribution type. Either uniform or zipf to generate data that follows Zipf's distribution (http://en.wikipedia.org/wiki/Zipf's_law)
cardinality Size of the sample space
percentageNull what percentage should be null

The instances section defines how many instance of table with the same specification to be created and the number of rows for each of them.

Test plan

There would be four tables of varying size 100MB/5GB/50GB and 100GB. There would be three tables for each size, one with no partition column, one with 3 partition column and 1 with 5 partition column.

Data

Number of partitions

Number of partition column

None

3

5

100MB

100

1

1

1

5GB

5k

1

1

1

50GB

50k

1

1

1

100GB

100k

1

1

1

The following pig scripts would be executed and the time taken monitored. The pig scripts would have only Load, Store and filter by operations:
1. Load data using PigStorage then store to an HCat table using !HCatStorage(). Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatStorage().
1. Use dynamic partitioning
1. Use static partitioning
1. Load data from !HCatTable using HCatLoader() and store it using PigStorage(). Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatLoader().
1. Use =filter by= on one of the partition columns of the tables, compare this with =filter by= on normal files.

Repeat the experiment for all the mentioned tables 10 times each and find the average time.

For concurrency test a client would be doing show partitions repeatedly. And the number of such concurrent client would be increased from 1 to 255 in steps of 5. The throughput would be recorded.

Questions

Does these comparison need to be done on RCFiles vs text files ?
Do we need a clean setup for each of the experiments or generating the date and then doing the tests is okay

Child pages