Metron Tutorial - Fundamentals Part 1: Creating a New Telemetry [DRAFT]

Export enrichment node to the environment variable

Reference ambari url

0.xBETA

In this blog post we will walk through what it takes to setup a new telemetry source in Metron. For this example we will setup a new sensor, capture the sensor logs, pipe the logs to Kafka, pick up the logs with a Metron parsing topology, parse them, and run them through the Metron stream processing pipeline.

Our example sensor will be a Squid Proxy. Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. Squid logs are simple to explain and easy to parse and the velocity of traffic coming from Squid is representative of a a typical network-based sensor. Hence, we feel it's a good telemetry to use for this tutorial.

Step 1: Build the Metron code

Prior to going through this tutorial make sure you have Metron properly built and tested. Please see here for full Metron installation and validation instructions. Verify that the project has been built before creating the VM. First lets get Metron from Apache.

git clone https://git-wip-us.apache.org/repos/asf/incubator-metron.git
git tag -l

Now you will see a list of Metron releases. You will see major releases, minor releases, and release candidaes. Refer to the Metron website with regards to which is the current stable release recommended for downloading. Once you select the Metron release run the following comand to download it:

cd incubator-metron
git checkout tags/[MetronReleaseVersion]

Now that we have downloaded Metron we need to build it. For the purposes of this exercise we will build without running through Metron's unit and integration test suites. To do so run the following command:

mvn clean package

Now we have downloaded and built metron it's on to the next step. Next we need to make a decision about the Metron environment and which parts of Metron we would like to build. For the purpose of this exercise we will assume that we are building full metron on the QuickDev image or Amazon AWS. We will provide instructions for both.

Step 2a : Setup Setup the QuickDev Image

If you want to take Metron for a spin local on your laptop you need to setup the QuickDev environment. The QuickDev environment is primarily intended for developers and people who want to spin up Metron quickly without incurring costs on AWS. But fair warning, this environment is not meant for production and is not performant. It is is intended merely for demonstration and development. To spin up QuickDev perform the following steps:

vagrant plugin install vagrant-hostmanager
cd metron-deployment/vagrant/quick-dev-platform
./launch_dev_image.sh
vagrant ssh

After executing the above commands a Metron VM will be built (called node1) and you will be logged in as user vagrant. There will be 4 topologies running but one must be stopped because the VM only has 4 Storm worker slots available.

Step 2b : Setup a Full AWS Metron Environment

The AWS environment is intended to install Metron in the AWS cloud. By default Metron is installed with a few demo sensors. To build the AWS environment in full run the following commands:

cd metron-deployment/amazon-ec2/
./run.sh

This will spin up your full AWS environment. At the end of the install the environment summary will be displayed as follows:

ok: [localhost] => {
"Success": [
"Apache Metron deployed successfully",
" Metron @ [METRON_HOST]:5000",
" Ambari @ [AMBARI_HOST]:8080",
" Sensor Status @ [SENSOR_HOST]:2812",
" Topology Status @ [MONIT_HOST]:2812",
"For additional information, see https://metron.incubator.apache.org/'"
]
}

Step 2c : Setup a Partial AWS Metron Environment

If you don't want to setup a full Metron environment you can deploy individual Metron modules. To do so, you need to define the Metron inventory. A sample inventory is provided with Metron to make custom inventories easier to define. To get to the inventory run the following command:

cd incubator-metron/metron-deployment/inventory/metron_example

There you will see two files: hosts and environment_vars/all. The first thing we need to define are environment variables for the Ansible scripts in the environment_vars/all file. This file contains granular settings for installation of each Metron component, including enabling or disabling the installation of a specific component as well as additional specifications. The second file you need to define is the hosts file. The hosts file defines the Metron cluster and what role individual nodes in the cluster will play. The following roles are possible:

[ambari_master] - host running Ambari
[ambari_slaves] - all Ambari-managed hosts
[metron_kafka_topics] - host used to create the Kafka topics required by Metron. Requires a Kafka broker.
[meron_hbase_tables] - host used to create the HBase tables required by Metron. Requires a HBase client.
[enrichment] - submits the topology code to Storm and requires a Storm client
[search] - host(s) where Elasticsearch will be installed
[web] - host where the Metron UI and underlying services will be installed
[sensors] - host where network data will be collected and published to Kafka

Once you configure the hosts and services, then run the following command:

cd incubator-metron/metron-deployment/playbooks
ansible-playbook -i ../inventory/project_name metron_install.yml --skip-tags="solr"

Step 2d : Setup Metron on an existing Ambari-managed Cluster (bare metal or AWS)

For this part it does not matter if you are installing core Metron components on bare metal or VMs. However it does matter for Metron sensors, as they need to be custom-compiled to the specific environment on which they are running. Currently we only support sensor installs on CentOS 6.7, Ansible 2.0.0.2, Java 8, and Intel x520 series of network cards.

First, we pre-assume that the Ambari cluster already exists. If it does not exist, you can deploy it by using the following set of instructions:

https://ambari.apache.org/1.2.1/installing-hadoop-using-ambari/content/ambari-chap1.html

The sample configuration for a 12-node cluster would be as follows:

node1 - [ambari_master]

node2 - [ambari_slaves]

node3 - [ambari_slaves]

node4 - [ambari_slaves]

node5 - [ambari_slaves]

node6 - [ambari_slaves]

node7 - [ambari_slaves]

node8 - [ambari_slaves]

node9-12 provision the OS + Java, but leave alone for now

Now we need to define a hosts file to install Metron on top of this cluster in the inventory hosts file. First we need to define a host to provision Metron Hbase tables (if using canned enrichments provided with Metron)

[metron_hbase_tables]
node9

Then we need to define a host to provision Metron's Kafka topics (if using canned sensors provided with Metron)

[metron_kafka_topics]
node9

The setup the node for your PCAP server/replay capability (if using the canned PCAP probe provided with Metron)

[pcap_server]
node9

Then define the node which will contain Storm jars and deployment scripts for Metron's parser and enrichment telemetries

#3rd ambari_slave
[enrichment]
node1

Then define the nodes which will contain Elastic Search master and slave nodes

#1 or more
[search]
node10
node11
node12

Then define nodes which will contain canned Metron sensors YAF, Bro, Snort, PCAP (if using caneed Metron sensors)

#1 only
[sensors]
node1

Then define the node where Kibana will be installed

#same as mysql in 12 node topology
[web]
node12

Finally define the node where MySQL will be installed (if using Geo enrichment)

[mysql]
node12

Then based on your cluster definition edit group_vars/all file and then run:

ansible-playbook -i ../inventory/project_name metron_install.yml --skip-tags="solr"

This will automatically install Metron on an Ambari-managed cluster. For more detailed instructions please refer to:

https://github.com/dlyle65535/incubator-metron/blob/METRON-260/metron-deployment/README.md

Step 3 : Installing a sample sensor

Log into the sensors node and install the squid sensor. If you are on the QuickDev platform your VM will be called node1. If you are on AWS environment your sensor node will be tagged with the [sensors] tag. You can look through the AWS console to find which node in your cluster has this tag.

Once you log into the sensor node you can install the Squid sensor.

sudo yum install squid
sudo service squid start

This will run through the install and the Squid sensor will be installed and started. Now lets look at Squid logs.

sudo su -
cd /var/log/squid
ls

You see that there are three types of logs available: access.log, cache.log, and squid.out. We are interested in access.log as that is the log that records the proxy usage. We see that initially the log is empty. Lets generate a few entries for the log.

squidclient "http://www.aliexpress.com/af/shoes.html?ltype=wholesale&d=y&origin=n&isViewCP=y&catId=0&initiative_id=SB_20160622082445&SearchText=shoes"
squidclient "http://www.help.1and1.co.uk/domains-c40986/transfer-domains-c79878"
squidclient "http://www.pravda.ru/science/"
squidclient "https://www.google.com/maps/place/Waterford,+WI/@42.7639877,-88.2867248,12z/data=!4m5!3m4!1s0x88059e67de9a3861:0x2d24f51aad34c80b!8m2!3d42.7630722!4d-88.2142563"
squidclient "http://www.brightsideofthesun.com/2016/6/25/12027078/anatomy-of-a-deal-phoenix-suns-pick-bender-chriss"
squidclient "https://www.microsoftstore.com/store/msusa/en_US/pdp/Microsoft-Band-2-Charging-Stand/productID.329506400"
squidclient "http://www.autonews.com/article/20151115/RETAIL04/311169971/toyota-fj-cruiser-is-scarce-hot-and-high-priced"
squidclient "https://tfl.gov.uk/plan-a-journey/"
squidclient "https://www.facebook.com/Africa-Bike-Week-1550200608567001/"
squidclient "http://www.ebay.com/itm/02-Infiniti-QX4-Rear-spoiler-Air-deflector-Nissan-Pathfinder-/172240020293?fits=Make%3AInfiniti%7CModel%3AQX4&hash=item281a4e2345:g:iMkAAOSwoBtW4Iwx&vxp=mtr"
squidclient "http://www.recruit.jp/corporate/english/company/index.html"
squidclient "http://www.lada.ru/en/cars/4x4/3dv/about.html"
squidclient "http://www.help.1and1.co.uk/domains-c40986/transfer-domains-c79878"
squidclient "http://www.aliexpress.com/af/shoes.html?ltype=wholesale&d=y&origin=n&isViewCP=y&catId=0&initiative_id=SB_20160622082445&SearchText=shoes"

vi /var/log/squid/access.log

In production environments you would configure your users web browsers to point to the proxy server, but for the sake of simplicity of this tutorial we will use the client that is packaged with the Squid installation After we use the client to simulate proxy requests the Squid log entries would look as follows:

1467011157.401 415 127.0.0.1 TCP_MISS/200 337891 GET http://www.aliexpress.com/af/shoes.html? - DIRECT/207.109.73.154 text/html
1467011158.083 671 127.0.0.1 TCP_MISS/200 41846 GET http://www.help.1and1.co.uk/domains-c40986/transfer-domains-c79878 - DIRECT/212.227.34.3 text/html
1467011159.978 1893 127.0.0.1 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html
1467011160.044 58 127.0.0.1 TCP_MISS/302 1471 GET https://www.google.com/maps/place/Waterford,+WI/@42.7639877,-88.2867248,12z/data=cdcd/var/log/squidm5squidclienthttp://www.aliexpress.com/af/shoes.html? - DIRECT/172.217.3.164 text/html
1467011160.145 155 127.0.0.1 TCP_MISS/200 133234 GET http://www.brightsideofthesun.com/2016/6/25/12027078/anatomy-of-a-deal-phoenix-suns-pick-bender-chriss - DIRECT/151.101.41.52 text/html
1467011161.224 1073 127.0.0.1 TCP_MISS/200 141323 GET https://www.microsoftstore.com/store/msusa/en_US/pdp/Microsoft-Band-2-Charging-Stand/productID.329506400 - DIRECT/2.19.142.162 text/html
1467011161.491 262 127.0.0.1 TCP_MISS/302 1955 GET http://www.autonews.com/article/20151115/RETAIL04/311169971/toyota-fj-cruiser-is-scarce-hot-and-high-priced - DIRECT/54.88.37.253 text/html
1467011162.627 1133 127.0.0.1 TCP_MISS/200 88544 GET https://tfl.gov.uk/plan-a-journey/ - DIRECT/54.171.145.187 text/html
1467011163.515 879 127.0.0.1 TCP_MISS/200 461930 GET https://www.facebook.com/Africa-Bike-Week-1550200608567001/ - DIRECT/69.171.230.68 text/html
1467011164.286 749 127.0.0.1 TCP_MISS/200 190407 GET http://www.ebay.com/itm/02-Infiniti-QX4-Rear-spoiler-Air-deflector-Nissan-Pathfinder-/172240020293? - DIRECT/23.74.62.44 text/html
1467011164.447 128 127.0.0.1 TCP_MISS/404 12920 GET http://www.recruit.jp/corporate/english/company/index.html - DIRECT/23.74.66.205 text/html
1467011166.125 1659 127.0.0.1 TCP_MISS/200 69469 GET http://www.lada.ru/en/cars/4x4/3dv/about.html - DIRECT/195.144.198.77 text/html
1467011166.543 401 127.0.0.1 TCP_MISS/200 41846 GET http://www.help.1and1.co.uk/domains-c40986/transfer-domains-c79878 - DIRECT/212.227.34.3 text/html
1467011168.519 445 127.0.0.1 TCP_MISS/200 336155 GET http://www.aliexpress.com/af/shoes.html? - DIRECT/207.109.73.154 text/html

Now that we have the sensor set up and generating logs we need to figure out how to pipe these logs to a Kafka topic. To do so the first thing we need to do is setup a new Kafka topic for Squid.

Step 4 : Define Environment Variables

export ZOOKEEPER=

export BROKERLIST=

export HDP_HOME=

export METRON_VERSION=

Step 5 : Create Kafka topics and ingest sample data

/usr/hdp/current/kafka-broker/bin//kafka-topics.sh --zookeeper $ZOOKEEPER:2181 --create --topic squid --partitions 1 --replication-factor 1
/usr/hdp/current/kafka-broker/bin//kafka-topics.sh --zookeeper $ZOOKEEPER:2181 --list

The following commands will setup a new Kafka topic for squid. Now let's test how we can pipe the Squid logs to Kakfka

cat /var/log/squid/access.log | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list $BROKERLIST:6667 --topic squid
$HDP_HOME/kafka/bin/kafka-console-consumer.sh --zookeeper $ZOOKEEPER:2181 --topic squid --from-beginning

This should ingest our Squid logs into Kafka. Now we are ready to tackle the Metron parsing topology setup. The first thing we need to do is decide if we will be using the Java-based parser of a Grok-based parser for the new telemetry. In this example we will be using the Grok parser. Grok parser is perfect for structured or semi-structured logs that are well understood (check) and telemetries with lower volumes of traffic (check). The first thing we need to do is define the Grok expression for our log. Refer to Grok documentation for additional details. In our case the pattern is:

SQUID_DELIMITED %{NUMBER:timestamp} %{SPACE:UNWANTED} %{INT:elapsed} %{IPV4:ip_src_addr} %{WORD:action}/%{NUMBER:code} %{NUMBER:bytes} %{WORD:method} %{NOTSPACE:url} - %{WORD:UNWANTED}\/%{IPV4:ip_dst_addr} %{WORD:UNWANTED}\/%{WORD:UNWANTED}

this is already pre-loaded under /apps/metron/patterns/squid

Notice that I apply the UNWANTED tag for any part of the message that I don't want included in my resulting JSON structure. Finally, notice that I applied the naming convention to the IPV4 field by referencing the following list of field conventions. The last thing I need to do is to validate my Grok pattern to make sure it's valid. For our test we will be using a free Grok validator called Grok Constructor. A validated Grok expression should look like this:

Now that the Grok pattern has been defined we need to save it and move it to HDFS. Existing Grok parsers that ship with Metron are staged under /apps/metron/patterns/

don't need to do this step if patterns already pre-loaded

[root@node1 bin]# hdfs dfs -ls /apps/metron/patterns/
Found 5 items
-rw-r--r-- 3 hdfs hadoop 13427 2016-04-25 07:07 /apps/metron/patterns/asa
-rw-r--r-- 3 hdfs hadoop 5203 2016-04-25 07:07 /apps/metron/patterns/common
-rw-r--r-- 3 hdfs hadoop 524 2016-04-25 07:07 /apps/metron/patterns/fireeye
-rw-r--r-- 3 hdfs hadoop 2552 2016-04-25 07:07 /apps/metron/patterns/sourcefire
-rw-r--r-- 3 hdfs hadoop 879 2016-04-25 07:07 /apps/metron/patterns/yaf

We need to move our new Squid pattern into the same directory. Create a file from the grok pattern above:

touch /tmp/squid
vi /tmp/squid

Then move it to HDFS:

su - hdfs
hdfs dfs -put /tmp/squid /apps/metron/patterns/
exit

Now that the Grok pattern is staged in HDFS we need to define a parser configuration for the Metron Parsing Topology. The configurations are kept in Zookeeper so the sensor configuration must be uploaded there after it has been created. A Grok parser configuration follows this format:

{
"parserClassName": "org.apache.metron.parsers.GrokParser",
"sensorTopic": "sensor name",
"parserConfig": {
"grokPath": "grok pattern",
"patternLabel": "grok label",
... other optional fields
}
}

Create a Squid Grok parser configuration file at /usr/metron/0.1BETA/config/zookeeper/parsers/squid.json with the following contents:

TODO

reference stellar docs

relink

{
"parserClassName": "org.apache.metron.parsers.GrokParser",
"sensorTopic": "squid",
"parserConfig": {
"grokPath": "/apps/metron/patterns/squid",
"patternLabel": "SQUID_DELIMITED",
"timestampField": "timestamp"
},
"fieldTransformations" : [
{
"transformation" : "MTL"
,"output" : [ "full_hostname", "domain_without_subdomains" ]
,"config" : {
"full_hostname" : "URL_TO_HOST(url)"
,"domain_without_subdomains" : "DOMAIN_REMOVE_SUBDOMAINS(full_hostname)"
}
}
]

}

Notice the use of the fieldTransformations in the parser configuration. Our Grok Parser is set up to extract the URL, but really we want just the domain or even the domain without subdomains. To do this, we can use the Metron Transformation Language field transformation. The Metron Transformation Language is a Domain Specific Language which allows users to define extra transformations to be done on the messages flowing through the topology. It supports a wide range of common network and string related functions as well as function composition and list operations. In our case, we extract the hostname from the URL via the URL_TO_HOST function and remove the domain names with DOMAIN_REMOVE_SUBDOMAINS thereby creating two new fields, "full_hostname" and "domain_without_subdomains" to each message.

A script is provided to upload configurations to Zookeeper. Upload the new parser config to Zookeeper:

/usr/metron/$METRON_VERSION/bin/zk_load_configs.sh --mode PUSH -i /usr/metron/$METRON_VERSION/config/zookeeper -z $ZOOKEEPER:2181

Start the new squid parser topology:

/usr/metron/$METRON_VERSION/bin/start_parser_topology.sh -k $BROKERLIST:6667 -z $ZOOKEEPER:2181 -s squid

Navigate to the squid parser topology in the Storm UI at http://node1:8744/index.html and verify the topology is up with no errors:

TODO

CREATE ES template before deployment

Now that we have a new running squid parser topology, generate some data to parse by running this command several times:

tail /var/log/squid/access.log | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list node1:6667 --topic squid

Refresh the Storm UI and it should report data being parsed:

Then navigate Elasticsearch at http://node1:9200/_cat/indices?v and verify that a squid index has been created:

health status index                     pri rep docs.count docs.deleted store.size pri.store.size

yellow open   yaf_index_2016.04.25.15     5   1       5485            0        4mb            4mb

yellow open   snort_index_2016.04.26.12   5   1      24452            0     14.4mb         14.4mb

yellow open   bro_index_2016.04.25.16     5   1       1295            0      1.9mb          1.9mb

yellow open   squid_index_2016.04.26.13   5   1          1            0      7.3kb          7.3kb

yellow open   yaf_index_2016.04.25.17     5   1      30750            0     17.4mb         17.4mb

In order to verify that the messages were indexed correctly first install elastic search Head plugin:

/usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head/1.x

And navigate to http://node1:9200/_plugin/head/

There you will see parsed message + performance timestamps. We will discuss the performance timestamps in another blog entry.

By convention the index where the new messages will be indexed is called squid_index_[timestamp] and the document type is squid_doc.

Now that we have the messages parsed and indexed we need to setup a Kibana dashboard. To do so access the dashboard on http://node1:5000/#/dashboard/file/default.json

To create a new ingest histogram we first need to setup a pinned query. Click on the query + button and pin a query for _type:squid_doc. This would look like:

Once the query is pinned it will show up in the pinned queries bar like so:

Once the query is established we can create a histogram panel. In the panel settings point the panel to listed to the Squid Logs pinned query you just created

And make sure that the time field points to the field called "timestamp:

Click OK and you should get a histogram that looks like this:

Now to add a detailed telemetry table create a new table panel, and similarly to the histogram panel point it to the Squid Logs pinned query. As a result the following table will be created:

Space shortcuts

Blog