Apache Knox Gateway - Getting Started

This guide describes the steps required to install, deploy and validate the Apache Knox Gateway.

Requirements

The following prerequisites must be installed to successfully complete the steps described in this guide.

Java

Java 1.6 or later is required. Use the command below to check the version of Java installed.

java -version

Hadoop

A local installation of a Hadoop Cluster is required at this time. Hadoop EC2 cluster and/or Sandbox installations are currently difficult
to access remotely via the Gateway. The EC2 and Sandbox limitation is caused by Hadoop services running with internal IP addresses. For the
Gateway to work in these cases it will need to be deployed on the EC2 cluster or Sandbox, at this time.

The instructions that follow assume that the Gateway is not collocated with the Hadoop clusters themselves and (most importantly) that the
hostnames and IP addresses of the cluster services are accessible by the gateway where ever it happens to be running.

The Hadoop cluster should be ensured to have WebHDFS, WebHCat (i.e. Templeton) and Oozie configured, deployed and running.

This release of the Apache Knox Gateway has been tested against the Hortonworks Sandbox 1.2 with these changes.

Installation

1. Download and extract the distribution ZIP

Download and extract the knox-{VERSION}.zip file into the installation directory that will contain your {GATEWAY_HOME}.
You can find the downloads for Knox releases on the Apache mirrors.

jar xf knox-{VERSION}.zip

This will create a directory knox-{VERSION} in your current directory.

2. Enter the {GATEWAY_HOME} directory

cd knox-{VERSION}

The fully qualified name of this directory will be referenced as {GATEWAY_HOME} throughout the remainder of this document.

3. Start the demo LDAP server (ApacheDS)

First, understand that the LDAP server provided here is for demonstration purposes. You may configure the LDAP specifics within the topology
descriptor for the cluster as described in step 5 below, in order to customize what LDAP instance to use. The assumption is that most users
will leverage the demo LDAP server while evaluating this release and should therefore continue with the instructions here in step 3.

Edit {GATEWAY_HOME}/conf/users.ldif if required and add your users and groups to the file. A sample end user
"bob" has been already included. Note that the passwords in this file are "fictitious" and have nothing to do with
the actual accounts on the Hadoop cluster you are using. There is also a copy of this file in the templates directory that you can use to start
over if necessary.

Start the LDAP server - pointing it to the config dir where it will find the users.ldif file in the conf directory.

java -jar bin/ldap.jar conf &

There are a number of log messages of the form Created null. that can safely be ignored. Take note of the port on which it was started as this needs to match later configuration.

4. Start the Gateway server

java -jar bin/server.jar

Take note of the port identified in the logging output as you will need this for accessing the gateway.

The server will prompt you for the master secret (password). This secret is used to secure artifacts used to secure artifacts used by the gateway server for things like SSL, credential/password aliasing. This secret will have to be entered at startup unless you choose to persist it. Remember this secret and keep it safe. It represents the keys to the kingdom. See the Persisting the Master section for more information.

5. Configure the Gateway with the topology of your Hadoop cluster

Edit the file {GATEWAY_HOME}/deployments/sample.xml

Change the host and port in the urls of the <service> elements for NAMENODE, TEMPLETON and OOZIE services to match your Hadoop cluster
deployment.

The default configuration contains the LDAP URL for a LDAP server. By default that file is configured to access the demo ApacheDS based LDAP
server and its default configuration. By default, this server listens on port 33389. Optionally, you can change the LDAP URL for the LDAP server to be used for authentication. This is set via the main.ldapRealm.contextFactory.url property in the <gateway><provider><authentication> section.

Save the file. The directory {GATEWAY_HOME}/deployments is monitored by the Gateway server and reacts to the discovery of a new or changed cluster topology descriptor by provisioning the endpoints and required filter chains to serve the needs of each cluster as described by the topology file. Note that the name of the file excluding the extension is also used as the path for that cluster in the URL. So for example
the sample.xml file will result in Gateway URLs of the form [http://]{}{gateway-host}:{gateway-port}/gateway/sample/namenode/api/v1

6. Test the installation and configuration of your Gateway

Invoke the LISTSATUS operation on HDFS represented by your configured NAMENODE by using your web browser or curl:

curl -i -k -u bob:bob-password -X GET \
    'https://localhost:8443/gateway/sample/namenode/api/v1/?op=LISTSTATUS'

The results of the above command should result in something to along the lines of the output below. The exact information returned is subject to the content within HDFS in your Hadoop cluster.

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 760
Server: Jetty(6.1.26)

{"FileStatuses":{"FileStatus":[
{"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350595859762,"owner":"hdfs","pathSuffix":"apps","permission":"755","replication":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"group":"mapred","length":0,"modificationTime":1350595874024,"owner":"mapred","pathSuffix":"mapred","permission":"755","replication":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350596040075,"owner":"hdfs","pathSuffix":"tmp","permission":"777","replication":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"group":"hdfs","length":0,"modificationTime":1350595857178,"owner":"hdfs","pathSuffix":"user","permission":"755","replication":0,"type":"DIRECTORY"}
]}}

For additional information on WebHDFS, Templeton/WebHCat and Oozie REST APIs, see the following URLs respectively:

Examples

More examples can be found here.

Persisting the Master Secret

The master secret is required to start the server. This secret is used to access secured artifacts by the gateway instance. Keystore, trust stores and credential stores are all protected with the master secret.

You may persist the master secret by supplying the -persist-master switch at startup. This will result in a warning indicating that persisting the secret is less secure than providing it at startup. We do make some provisions in order to protect the persisted password.

It is encrypted with AES 128 bit encryption and where possible the file permissions are set to only be accessable by the user that the gateway is running as.

After persisting the secret, ensure that the file at config/security/master has the appropriate permissions set for your environment. This is probably the most important layer of defense for master secret. Do not assume that the encryption if sufficient protection.

A specific user should be created to run the gateway this will protect a persisted master file.

Management of Security Artifacts

There are a number of artifacts that are used by the gateway in ensuring the security of wire level communications, access to protected resources and the encryption of sensitive data. These artifacts can be managed from outside of the gateway instances or generated and populated by the gateway instance itself.

The following is a description of how this is coordinated with both standalone (development, demo, etc) gateway instances and instances as part of a cluster of gateways in mind.

Upon start of the gateway server we:

1. Look for an identity store at conf/security/keystores/gateway.jks. The identity store contains the certificate and private key used to represent the identity of the server for SSL connections and signature creation.

  • If there is no identity store we create one and generate a self-signed
    certificate for use in standalone/demo mode. The certificate is stored
    with an alias of gateway-identity.
  • If there is an identity store found than we ensure that it can be loaded
    using the provided master secret and that there is an alias with called
    gateway-identity.

2. Look for a credential store at conf/security/keystores/__gateway-credentials.jceks. This credential
store is used to store secrets/passwords that are used by the gateway. For instance, this is where the pass-phrase for accessing the
gateway-identity certificate is kept.

  • If there is no credential store found then we create one and populate it
    with a generated pass-phrase for the alias gateway-identity-passphrase.
    This is coordinated with the population of the self-signed cert into the
    identity-store.
  • If a credential store is found then we ensure that it can be loaded using
    the provided master secret and that the expected aliases have been
    populated with secrets.

Upon deployment of a Hadoop cluster topology within the gateway we:

1. Look for a credential store for the topology. For instance, we have a
sample topology that gets deployed out of the box. We look for
conf/security/keystores/sample-credentials.jceks. This topology specific
credential store is used for storing secrets/passwords that are used for
encrypting sensitive data with topology specific keys.

  • If no credential store is found for the topology being deployed then
    one is created for it. Population of the aliases is delegated to the
    configured providers within the system that will require the use of a
    secret for a particular task. They may programmatic set the value
    of the secret or choose to have the value for the specified alias
    generated through the AliasService.
  • If a credential store is found then we ensure that it can be loaded
    with the provided master secret and the configured providers have the
    opportunity to ensure that the aliases are populated and if not to
    populate them.

By leveraging the algorithm described above we can provide a window of
opportunity for management of these artifacts in a number of ways.

1. Using a single gateway instance as a master instance the artifacts can be
generated or placed into the expected location and then replicated across
all of the slave instances before startup.
2. Using an NFS mount as a central location for the artifacts would provide
a single source of truth without the need to replicate them over the
network. Of course, NFS mounts have their own challenges.

Summary of Secrets to be Managed:

1. Master secret - the same for all gateway instances in a cluster of gateways
2. All security related artifacts are protected with the master secret
3. Secrets used by the gateway itself are stored within the gateway credential
store and are the same across all gateway instances in the cluster of
gateways
4. Secrets used by providers within cluster topologies are stored in topology
specific credential stores and are the same for the same topology across
the cluster of gateway instances. However, they are specific to the
topology - so secrets for one hadoop cluster are different from those of
another. This allows for fail-over from one gateway instance to another
even when encryption is being used while not allowing the compromise of one
encryption key to expose the data for all clusters.

NOTE: the SSL certificate will need special consideration depending on the
type of certificate. Wildcard certs may be able to be shared across all
gateway instances in a cluster. When certs are dedicated to specific machines
the gateway identity store will not be able to be blindly replicated as
hostname verification problems will ensue. Obviously, trust-stores will need
to be taken into account as well.

Mapping Gateway URLs to Hadoop cluster URLs

The Gateway functions much like a reverse proxy. As such it maintains a
mapping of URLs that are exposed externally by the Gateway to URLs that are
provided by the Hadoop cluster. Examples of mappings for the NameNode and
Templeton are shown below. These mapping are generated from the combination
of the Gateway configuration file (i.e. {GATEWAY_HOME}/conf/gateway-site.xml)
and the cluster topology descriptors
(e.g. {GATEWAY_HOME}/deployments/{cluster-name}.xml).

  • HDFS (NameNode)
    • Gateway: http://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/namenode/api/v1
    • Cluster: http://{namenode-host}:50070/webhdfs/v1
  • WebHCat (Templeton)
    • Gateway: http://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/templeton/api/v1
    • Cluster: http://{templeton-host}:50111/templeton/v1
  • Oozie
    • Gateway: http://{gateway-host}:{gateway-port}/{gateway-path}/{cluster-name}/oozie/api/v1
    • Cluster: http://{templeton-host}:11000/oozie/v1

The values for {gateway-host}, {gateway-port}, {gateway-path} are
provided via the Gateway configuration file
(i.e. {GATEWAY_HOME}/conf/gateway-site.xml).

The value for {cluster-name} is derived from the name of the cluster
topology descriptor (e.g. {GATEWAY_HOME}/deployments/{cluster-name}.xml).

The value for {namenode-host} and {templeton-host} is provided via the
cluster topology descriptor
(e.g. {GATEWAY_HOME}/deployments/{cluster-name}.xml).

Note: The ports 50070, 50111 and 11000 are the defaults for NameNode,
Templeton and Oozie respectively. Their values can also be provided via
the cluster topology descriptor if your Hadoop cluster uses different
ports.

Enabling logging

If necessary you can enable additional logging by editing the
log4j.properties file in the conf directory. Changing the rootLogger
value from ERROR to DEBUG will generate a large amount of debug logging.
A number of useful, more fine loggers are also provided in the file.

Filing bugs

Bugs can be filed using Jira.  Please include the results of this command below in the Environment section. Also include the version of Hadoop being used.

java -jar bin/server.jar -version

Trademarks

Apache Knox Gateway, Apache, the Apache feather logo and the Apache Knox Gateway project logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.

License

Apache Knox uses the standard Apache license.

Privacy Policy

Apache Knox uses the standard Apache privacy policy.

  • No labels