Page History

...

Establish the baseline performance numbers, in order to monitor change in performance as new releases are made
Find out overhead of using HCatalog
Understand limitations of HCatalog (e.g.: number of parallel connections that can be handled by HCatalog Server, number of partitions that can be added in a pig hadoop job) etc.

In order to meet the above objective following would be measured:

Time taken for basic use case operations (create get table, show database list partition etc) with increasing number of threadsconcurrent clients.
Find the overhead of using HCatalog Loader/Storer in pig over default PigLoader/PigStorer for various data size and number of partitions.

Touch points

...

.

...

Using the HcatStorer
Using the HcatLoader
filter by clause on partitioned table

Implementation

Test setup

Test setup needs to perform two tasks:

...

Both of these are driven by configuration file. Following is an example of setup xml file.

...

HCatMix Test Setup Configuration

Code Block

	xml
	xml

<database>
    <tables>
        <table>
            <namePrefix>page_views_1brandnew</namePrefix>
            <dbName>default</dbName>
            <columns>
                <column>
                    <name>user</name>
                    <type>STRING</type>
                    <avgLength>20</avgLength>
                    <distribution>zipf</distribution>
                    <cardinality>1000000</cardinality>
                    <percentageNull>10</percentageNull>
                </column>
                <column>
                    <name>timespent</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>5000000</cardinality>
                    <percentageNull>25</percentageNull>
                </column>
                <column>
                    <name>query_term</name>
                    <type>STRING</type>
                    <avgLength>5</avgLength>
                    <distribution>zipf</distribution>
                    <cardinality>10000</cardinality>
                    <percentageNull>0</percentageNull>
                </column>
                 .
                 .
                 .
            </columns>
            <partitions>
                <column><partition>
                    <name>timestamp</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>1000</cardinality>
                </column>partition>
                <partition>
                    <name>action</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>8</cardinality>
                </partition>
                 .
                 .
                 .
            </partitions>
            <instances>
                <instance>
                    <size>1000000</size>
                    <count>1</count>
                </instance>
                <instance>
                    <size>100000</size>
                    <count>1</count>
                </instance>
                 .
                 .
                 .
            </instances>
        </table>
    </tables>
</database>

...

name: of the column
type: Type of data (string/int/map etc)
avgLength: average length if the type is string
distribution distribution type. Either uniform or zipf to generate data that follows Zipf's distribution (http://en.wikipedia.org/wiki/Zipf's_law)
cardinality Size of the sample space
percentageNull what percentage should be null

The instances section defines how many instance of table with the same specification to be created and the number of rows for each of them.

...

Tests

Loader/Storer tests

Description

There would be four tables of varying size 100MB/5GB/50GB and 100GB. There would be three tables for each size, one with no partition column, one with 3 partition column and 1 with 5 partition column. Dataare four Loader/Storer test to find out overhead of using HCatLoder() and HCatStorer for the same data:

Use PigStorage to load and HCatStorer to store
Use HCatLoader to load and HCatStorer to store
Use HCatLoader to load and PigStorage to store
Use PigStorage to load and PigStorage to store

The tests are done on following data:

Input Data Size	Number of partitions

...

Number of partition column

...

None

...

3

...

5

...

100MB

...

100

...

1

...

1

...

1

...

5GB

...

5k

...

1

...

1

...

1

...

50GB

...

50k

...

1

...

1

...

1

...

100GB

...

100k

...

1

...

1

...

1

The following pig scripts would be executed and the time taken monitored. The pig scripts would have only Load, Store and filter by operations:
1. Load data using PigStorage then store to an HCat table using !HCatStorage(). Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatStorage().
1. Use dynamic partitioning
1. Use static partitioning
1. Load data from !HCatTable using HCatLoader() and store it using PigStorage(). Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatLoader().
1. Use =filter by= on one of the partition columns of the tables, compare this with =filter by= on normal files.

Repeat the experiment for all the mentioned tables 10 times each and find the average time.

For concurrency test a client would be doing show partitions repeatedly. And the number of such concurrent client would be increased from 1 to 255 in steps of 5. The throughput would be recorded.

Questions

...

105MB	0, 300, 600, 900, 1200, 1500, 2000
1GB	0, 300
10GB	0, 300
100GB	0, 300

These tests are driven by configuration and new test could be added by dropping configuration.

How to run:

Run the whole test suite:
The following will run for all the config files as specified in hcatmix/src/test/resources/hcatmix_load_store_tests.yml
mvn test -Dtest=TestLoadStoreScripts -Phadoop20

Run individual tests:
mvn test -Dtest=TestLoadStoreScripts -DhcatSpecFile=src/test/resources/performance/100GB_300_parititons.xml -DnumRuns=1 -DnumDataGenMappers=30 -Phadoop20
This will run the load store test for the file =src/test/resources/performance/100GB_300_parititons.xml= only
It will run it all the four load/store combinations only 1 time
While generating data 30 mappers are used. This can be increased to reflect the number of mappers available for your cluster to reduce the time taken to generate test data.
The tests can be run for hadoop 0.23 or 0.20 based on the maven profile hadoop20 or hadoop23
Results(html with graphs/json) are in target/results directory

Load Tests

Description

The hadoop map reduce framework itself has been used to do concurrency test, where in the map phase increases the number of tasks over time and keeps on generating
statistics every minute. The reduce phase aggregates the statistics of all the maps and outputs statistics as number of concurrent clients were increasing. Given map/reduce is used this tool can scale to any number of parallel clients required to do concurrency test.

Concurrency tests are done for the following api call:

List partition
Get Table
Add Partition
List Partition/Add Partition/Get Table together

The test is defined in a properties file

Code Block


# For the following example the number of threads will increase from
# 80 to 2000 over a period of 25 minutes. T25 = 4*20 + (25 - 1)*4*20 = 2000

# The comma separated task classes which contains the getTable() call
task.class.names=org.apache.hcatalog.hcatmix.load.tasks.HCatGetTable

# The number of map tasks to run
num.mappers=20

# How many threds to increase at the end of fixed interval
thread.increment.count=4

# The interval at which number of threads are increased
thread.increment.interval.minutes=1

# For how long the map would run
map.runtime.minutes=25

# Extra wait time to let the individual tasks to finish
thread.completion.buffer.minutes=1

# The interval at which statistics would be collected
stat.collection.interval.minutes=1

# input directory where dummy files are created to control the number of mappers
input.dir=/tmp/hcatmix/loadtest/input

# The location where the collected statistics would be stored
output.dir=/tmp/hcatmix/loadtest/output

More concurrent tests can be added by adding configuration files and adding a class that implements the Task interface.

How to run

For the whole suite to run mvn test -Dtest=TestHCatalogLoad -DloadTestConfFile=src/main/resources/load/hcat_get_table_load_test.properties -Phadoop20
For running one test only: mvn test -Dtest=TestHCatalogLoad -DloadTestConfFile=src/main/resources/load/hcat_get_table_load_test.properties -Phadoop20
Results: Results are html pages with graphs

Prerequisites

The following environment variables need to be defined:

HADOOP_HOME
HCAT_HOME
HADOOP_CONF_DIR

Misc Links

https://cwiki.apache.org/PIG/pigmix.html^{Image Removed}
https://issues.apache.org/jira/browse/PIG-200^{Image Removed}http://twiki.corp.yahoo.com/view/Grid/HCatalog010Performance^{Image Removed}

Child pages

Versions Compared

Old Version 3

New Version Current

Key

Touch points

Implementation

Test setup

HCatMix Test Setup Configuration

Tests

Loader/Storer tests

Description

Questions

How to run:

Load Tests

Description

How to run

Prerequisites

Misc Links