How to Build
PXF uses Gradle. For simplicity we've encapsulated it with simple make commands. Make sure Java is installed prior to building PXF.
# Clone hawq repository if you haven't previously done git clone https://git-wip-us.apache.org/repos/asf/incubator-hawq.git # Head to PXF code cd incubator-hawq/pxf # Compile & Test PXF make # Simply Run unittest make unittest
Setup Prerequisites
Setup HAWQ and Hadoop
PXF requires HAWQ and Hadoop. Please follow the steps here to Setup HAWQ and refer to Install Hadoop section to setup Hadoop.
Setup Hive
Hive needs to be installed only if you wish to run HAWQ against Hive tables
Download any hive 1.x release from one of the Apache download mirrors http://www.apache.org/dyn/closer.cgi/hive/
# Extract Hive tarbal tar -xzvf apache-hive-x.y.z-bin.tar.gz # Set Hadoop path export HADOOP_HOME=<your hadoop deployment location> export HADOOP_USER_CLASSPATH_FIRST=true # Set Hive path cd apache-hive-x.y.z-bin export HIVE_HOME=$PWD # Create HDFS temp directory $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp # Create Hive Warehouse $HADOOP_HOME/bin/hadoop fs -mkdir -p /user/hive/warehouse $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse # Set Metastore URI in $HIVE_HOME/conf/hive-site.xml <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> </property> </configuration> # Start Hive Metastore as a background process $HIVE_HOME/bin/hive --service metastore & # Use Hive CLI/Shell $HIVE_HOME/bin/hive hive> CREATE TABLE hivetest (id INT, name STRING); hive> SHOW TABLES; hive> DESCRIBE hivetest;
Install PXF
cd $CODE_BASE/pxf # Set PXF home directory. # GPHOME must have been set previously as per the HAWQ Build instructions export PXF_HOME=$GPHOME/pxf # Install PXF make install # This would create the necessary artifacts under PXF_HOME
Configure PXF
You will see the PXF configuration files in $PXF_HOME/conf
Update the following files based on your environment and hadoop directly layout.
- pxf-env.sh
- Set LD_LIBRARY_PATH to ${HADOOP_HOME}/lib/native
- Set PXF_LOGDIR to ${PXF_HOME}/logs
- Set PXF_RUNDIR to ${PXF_HOME}
- Set PXF_USER to your username
- pxf-log4j.properties
- Set log4j.appender.ROLLINGFILE.File to the expanded path of $PXF_HOME/logs/pxf-service.log. (Don't use the environment variable in this file)
- pxf-private.classpath
- Update the library and configuration paths of hadoop,hive,pxf, etc. Use only absolute paths without referring to environment variables
- Update the library and configuration paths of hadoop,hive,pxf, etc. Use only absolute paths without referring to environment variables
Init/Start/Stop PXF
# Deploy PXF $PXF_HOME/bin/pxf init # If you get an error "WARNING: instance already exists in ..." make sure you clean up pxf-service directory under $PXF_HOME/bin/pxf and rerun init # Create PXF Log Dir mkdir $PXF_HOME/logs # Start PXF $PXF_HOME/bin/pxf start # Check Status $PXF_HOME/bin/pxf status # You can also check if the service is running by using the following request to check API version curl "localhost:51200/pxf/ProtocolVersion" # To stop PXF $PXF_HOME/bin/pxf stop ## Note: If you see a failure
Test PXF
Below are steps which demonstrates accessing a HDFS file from HAWQ.
# Create an HDFS directory for PXF example data files $HADOOP_HOME/bin/hadoop fs -mkdir -p /data/pxf_examples # Create a delimited plain text data file named pxf_hdfs_simple.txt: echo 'Prague,Jan,101,4875.33 Rome,Mar,87,1557.39 Bangalore,May,317,8936.99 Beijing,Jul,411,11600.67' > /tmp/pxf_hdfs_simple.txt # Add the data file to HDFS: $HADOOP_HOME/bin/hadoop -put /tmp/pxf_hdfs_simple.txt /data/pxf_examples/ #Display the contents of the pxf_hdfs_simple.txt file stored in HDFS: $HADOOP_HOME/bin/hadoop -cat /data/pxf_examples/pxf_hdfs_simple.txt
Now you can access the hdfs file from HAWQ using the HdfsTextSimple profile as shown below.
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://localhost:51200/data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=HdfsTextSimple') FORMAT 'TEXT' (delimiter=E','); postgres=# SELECT * FROM pxf_hdfs_textsimple; location | month | num_orders | total_sales ---------------+-------+------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 Beijing | Jul | 411 | 11600.67 (4 rows)
Below are steps which demonstrates accessing a Hive table from HAWQ
# Create a Hive table to expose our sample data set. hive> CREATE TABLE sales_info (location string, month string, number_of_orders int, total_sales double) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile; # Load the pxf_hive_datafile.txt sample data file into the sales_info table you just created: hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_datafile.txt' INTO TABLE sales_info; # Perform a query from hive on sales_info to verify that the data was loaded successfully: hive> SELECT * FROM sales_info; # Query the table from HAWQ to access the hive table postgres=# SELECT * FROM hcatalog.default.sales_info location | month | num_orders | total_sales ---------------+-------+------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 ...