...
How to
...
Build
...
PXF uses Gradle. For simplicity we've encapsulated it with simple make commands
From pxf directory
Code Block |
---|
make |
To run tests
. Make sure Java is installed prior to building PXF.
Code Block | ||
---|---|---|
| ||
# Clone hawq repository if you haven't previously done
git clone https://git-wip-us.apache.org/repos/asf/hawq.git
# Head to PXF code
cd hawq/pxf
# Compile & Test PXF
make
# Simply Run unittest
make unittest |
Setup Prerequisites
If you simply need to test PXF, you can do so using the Demo profile which doesn't require any requisites as all it does is test against static data from PXF.
Setup Hadoop
Please follow the steps here: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
Note:
- you might need to build hadoop from source on Red Hat/CentOS 6.x if the downloaded hadoop package has higher glibc version requirement. When that happens, you will probably see the warning below when running start-dfs.sh." WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform"
- You will also need to set the port for fs.defaultFS to 8020 in etc/hadoop/core-site.xml (The example above set it as 9000.)
- HDFS is a must, but YARN is optional. YARN is only needed when you want to use YARN as the global resource manager.
- must setup passphraseless ssh, otherwise there will be some problems of "hawq init cluster" in the following step.
Your need to verify your HDFS works.
Code Block | ||
---|---|---|
| ||
# start HDFS
start-dfs.sh
# Do some basic tests to make sure HDFS works
echo "test data" >> ./localfile
hadoop fs -mkdir /test
hadoop fs -put ./localfile /test
hadoop fs -ls /
hadoop fs -get /test/localfile ./hdfsfile |
Setup HAWQ (If you use HAWQ)
Please follow the steps here to Setup HAWQ
Setup Hive (If you need Hive)
Hive needs to be installed only if you wish to run HAWQ against Hive tables
Download any hive 1.x release from one of the Apache download mirrors http://www.apache.org/dyn/closer.cgi/hive/
Code Block | ||
---|---|---|
| ||
# Extract Hive tarbal
tar -xzvf apache-hive-x.y.z-bin.tar.gz
# Set Hadoop path
export HADOOP_HOME=<your hadoop deployment location>
export HADOOP_USER_CLASSPATH_FIRST=true
# Set Hive path
cd apache-hive-x.y.z-bin
export HIVE_HOME=$PWD
# Create HDFS temp directory
$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
# Create Hive Warehouse
$HADOOP_HOME/bin/hadoop fs -mkdir -p /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
# Set Metastore URI in $HIVE_HOME/conf/hive-site.xml
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
</configuration>
# Start Hive Metastore as a background process
$HIVE_HOME/bin/hive --service metastore &
# Use Hive CLI/Shell
$HIVE_HOME/bin/hive
hive> CREATE TABLE hivetest (id INT, name STRING);
hive> SHOW TABLES;
hive> DESCRIBE hivetest; |
Install PXF
Code Block | ||
---|---|---|
| ||
cd $CODE_BASE/pxf
# Set PXF home directory.
# GPHOME must have been set previously as per the HAWQ Build instructions
export PXF_HOME=$GPHOME/pxf
# If you wish to install PXF for GPDB, please refer to Build PXF for other databases section below instead.
make install
# This would create the necessary artifacts under PXF_HOME
|
PXF RPM (Optional)
If you prefer using the rpm approach of installing PXF on an environment that has the required rpm dependancies installed, you can do the following
Code Block |
---|
# Create rpm artifacts
make rpm
# Install apache tomcat rpm
rpm -ivh distribution/apache-tomcat-*.rpm
# Install the pxf rpms
rpm -ivh build/distribution/pxf* |
Configure PXF
You will see the PXF configuration files in $PXF_HOME/conf
Update the following files based on your environment and hadoop directly layout.
- pxf-env.sh
- Set LD_LIBRARY_PATH to ${HADOOP_HOME}/lib/native
- Set PXF_LOGDIR to ${PXF_HOME}/logs
- Set PXF_RUNDIR to ${PXF_HOME}
- Set PXF_USER to your username
- pxf-log4j.properties
- Set log4j.appender.ROLLINGFILE.File to the expanded path of $PXF_HOME/logs/pxf-service.log. (Don't use the environment variable in this file)
- pxf-private.classpath
- Update the library and configuration paths of hadoop,hive,pxf, etc. Use only absolute paths without referring to environment variables
- Update the library and configuration paths of hadoop,hive,pxf, etc. Use only absolute paths without referring to environment variables
Init/Start/Stop PXF
Code Block | ||
---|---|---|
| ||
# Deploy PXF
$PXF_HOME/bin/pxf init
# If you get an error "WARNING: instance already exists in ..." make sure you clean up pxf-service directory under $PXF_HOME/bin/pxf and rerun init
# Create PXF Log Dir
mkdir $PXF_HOME/logs
# Start PXF
$PXF_HOME/bin/pxf start
# Check Status
$PXF_HOME/bin/pxf status
# You can also check if the service is running by using the following request to check API version
curl "localhost:51200/pxf/ProtocolVersion"
# To stop PXF $PXF_HOME/bin/pxf stop
## Note: If you see a failure
|
Test PXF
Below are steps which demonstrates accessing a HDFS file from HAWQ.
Code Block | ||
---|---|---|
| ||
# Create an HDFS directory for PXF example data files
$HADOOP_HOME/bin/hadoop fs -mkdir -p /data/pxf_examples
# Create a delimited plain text data file named pxf_hdfs_simple.txt:
echo 'Prague,Jan,101,4875.33' > /tmp/pxf_hdfs_simple.txt
echo 'Rome,Mar,87,1557.39' >> /tmp/pxf_hdfs_simple.txt
echo 'Bangalore,May,317,8936.99' >> /tmp/pxf_hdfs_simple.txt
echo 'Beijing,Jul,411,11600.67' >> /tmp/pxf_hdfs_simple.txt
# Add the data file to HDFS:
$HADOOP_HOME/bin/hadoop fs -put /tmp/pxf_hdfs_simple.txt /data/pxf_examples/
#Display the contents of the pxf_hdfs_simple.txt file stored in HDFS:
$HADOOP_HOME/bin/hadoop fs -cat /data/pxf_examples/pxf_hdfs_simple.txt |
Now you can access the hdfs file from HAWQ using the HdfsTextSimple profile as shown below.
Code Block | ||
---|---|---|
| ||
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://localhost:51200/data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=HdfsTextSimple')
FORMAT 'TEXT' (delimiter=E',');
postgres=# SELECT * FROM pxf_hdfs_textsimple;
location | month | num_orders | total_sales
---------------+-------+------------+-------------
Prague | Jan | 101 | 4875.33
Rome | Mar | 87 | 1557.39
Bangalore | May | 317 | 8936.99
Beijing | Jul | 411 | 11600.67
(4 rows) |
Below are steps which demonstrates accessing a Hive table from HAWQ
Code Block | ||
---|---|---|
| ||
# Create a Hive table to expose our sample data set.
hive> CREATE TABLE sales_info (location string, month string,
number_of_orders int, total_sales double)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS textfile;
# Load the pxf_hive_datafile.txt sample data file into the sales_info table you just created:
hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_datafile.txt'
INTO TABLE sales_info;
# Perform a query from hive on sales_info to verify that the data was loaded successfully:
hive> SELECT * FROM sales_info;
# Query the table from HAWQ to access the hive table
postgres=# SELECT * FROM hcatalog.default.sales_info
location | month | num_orders | total_sales
---------------+-------+------------+-------------
Prague | Jan | 101 | 4875.33
Rome | Mar | 87 | 1557.39
Bangalore | May | 317 | 8936.99
... |
Build PXF for other databases
PXF can be deployed to different environments, for different databases. Thus it's convenient to tailor PXF build for some specific default configuration parameters, such as - default PXF user, default log and run directories.
All supported databases are stored in hawq/pxf/gradle/profiles. By default, HAWQ databases is used.
To build PXF bundle for GPDB:
Code Block | ||
---|---|---|
| ||
make install DATABASE=gpdb |
...