Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Modified the way Squid data is produced.

...

The previous tutorials covering Squid produced a limited data set. These consisted of a few basic requests. To make this tutorial more interesting, we are going to need a bit more variety in the sample data.

Open a terminal and run 1. Copy and paste the following set of links to a local file called `links.txt`.

    https://www.amazon.com/Cards-Against-Humanity-LLC-CAHUS/dp/B004S8F7QM/ref=zg_bs_toys-and-games_home_1?pf_rd_p=2140216822&pf_rd_s=center-1&pf_rd_t=2101&pf_rd_i=home&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=2231TS0FE044EZT85PQ4
    https://www.amazon.com/Brain-Game-Cube-Intelligence-Development/dp/B01CRXM1JU/ref=zg_bs_toys-and-games_home_2?pf_rd_p=2140216822&pf_rd_s=center-1&pf_rd_t=2101&pf_rd_i=home&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=MANXEWDTKDH2RD9Y3466
    https://www.amazon.com/Zuru-Balloons-different-colors-Seconds/dp/B00ZPW3U14/ref=zg_bs_toys-and-games_home_3?pf_rd_p=2140216822&pf_rd_s=center-1&pf_rd_t=2101&pf_rd_i=home&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=MANXEWDTKDH2RD9Y3466
    https://www.amazon.com/MAGINOVO-Bluetooth-Headphones-Wireless-Earphones/dp/B01EFKFQL8/ref=zg_bs_electronics_home_1?pf_rd_p=2140225402&pf_rd_s=center-2&pf_rd_t=2101&pf_rd_i=home&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=MANXEWDTKDH2RD9Y3466
    https://www.amazon.com/Amazon-Fire-TV-Stick-Streaming-Media-Player/dp/B00GDQ0RMG/ref=zg_bs_electronics_home_2?pf_rd_p=2140225402&pf_rd_s=center-2&pf_rd_t=2101&pf_rd_i=home&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=MANXEWDTKDH2RD9Y3466
    http://www.walmart.com/ip/All-the-Light-We-Cannot-See/26737727
    http://www.walmart.com/ip/Being-Mortal-Medicine-and-What-Matters-in-the-End/36958209
    http://www.walmart.com/ip/My-Brilliant-Friend-Book-One-Childhood-Adolescence/20527482
    http://www.walmart.com/ip/A-Game-of-Thrones/402949
    http://www.bbc.co.uk/capital/story/20160622-there-are-people-making-millions-from-your-pets-poo
    http://www.bbc.co.uk/earth/story/20160620-can-we-predict-the-time-of-our-death
    http://www.bbc.co.uk/news/uk-england-somerset-36596557

2. Run this command to choose one of the links above at random and make a request for that link through Squid. Leave this command running in a terminal series of commands. This will download a list of the top 1 million sites as defined by Alexa. This will continually choose one at random and make a request through Squid for that web site. Leave this command running on the host so that a continual feed of data is generated as we work through the remainder of this tutorial.

curl  -O http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
unzip top-1m.csv.zip
  while sleep 2; do head -10 top-1m.csvcat links.txt | shuf -n 1 | awk -F, '{print $2}' | xargs -i squidclient -g 4 -v "http://{}"; done

3. The previous command is generating log records at `/var/log/squid/access.log`. Run the following command in another terminal to extract this data and publish it to Kafka. Again, leave this command running to generate that continuous feed of data. You will need to have two separate terminal sessions left running.


    tail -F /var/log/squid/access.log | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list ec2-50-112-203-38.us-west-2.compute.amazonaws.com:6667[kafka_broker]:[kafka_port] --topic squid

4. Ensure that the parser topology for Squid continues to run based on the steps outlined in the previous tutorials.

...