Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin


  • A plugin.xml file that tells nutch about your plugin.
  • A build.xml file that tells ant how to build your plugin.
  • Wiki Markup
    The source code of your plugin in the directory structure recommended/src/java/org/apache/nutch/parse/recommended/\[Source_Here\].


Your plugin.xml file should look like this:


Wiki Markup
In order to build it, change to your plugin's directory where you saved the build.xml file (probably \[!YourCheckoutDir\]/src/plugin/recommended), and simply type

No Format


Hopefully you'll get a long string of text, followed by a message telling you of a successful build.


Wiki Markup
We'll need to create two files for unit testing:  a page we'll do the testing against, and a class to do the testing with.  Again, let's assume your plugin directory is \[!YourCheckoutDir\]/src/plugin and that your test plugin is under that directory.  Create directory recommended/data, and under it make a new file called recommended.html

No Format


<html lang="en">
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta name="generator" content="TextMate">
    <meta name="author" content="Ricardo J. Méndez">
    <meta name="recommended" content="recommended-content"/>
    <!-- Date: 2007-02-12 -->
    Recommended meta tag test.

This file contains the meta tag we're currently parsing for, with the value recommended-content. After that gratuitous bit of free publicity for my current favorite editor, let's move on to the testing class.


Wiki Markup
Create a new tree structure, this time for the test code, for example recommended/src/test/org/apache/nutch/parse/recommended/\[Test_Source_Here\].  There you'll create a file called [TestRecommendedParser].java.

No Format

package org.apache.nutch.parse.recommended;

import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseUtil;
import org.apache.nutch.protocol.Content;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.util.NutchConfiguration;

import java.util.Properties;

import junit.framework.TestCase;

 * Loads test page recommended.html and verifies that the recommended 
 * meta tag has recommended-content as its value.
public class TestRecommendedParser extends TestCase {

  private static final File testDir =
    new File(System.getProperty(""));

  public void testPages() throws Exception {
    pageTest(new File(testDir, "recommended.html"), "",


  public void pageTest(File file, String url, String recommendation)
    throws Exception {

    String contentType = "text/html";
    InputStream in = new FileInputStream(file);
    ByteArrayOutputStream out = new ByteArrayOutputStream((int)file.length());
    byte[] buffer = new byte[1024];
    int i;
    while ((i = != -1) {
      out.write(buffer, 0, i);
    byte[] bytes = out.toByteArray();
    Configuration conf = NutchConfiguration.create();

    Content content =
      new Content(url, url, bytes, contentType, new Metadata(), conf);
    Parse parse = new ParseUtil(conf).parseByExtensionId("parse-html",content);

    Metadata metadata = parse.getData().getContentMeta();
    assertEquals(recommendation, metadata.get("Recommended"));
    assertTrue("somesillycontent" != metadata.get("Recommended"));

As you can see, this code first parses the document, looks for the Recommended item in the object contentMeta - which we saved on RecommendedParser - and verifies that it's set to value recommended-content.


Wiki Markup
Now add some lines to the build.xml file located in \[!YourCheckoutDir\]/src/plugin/recommended directory, so that at a minimum its contents are:

No Format

<?xml version="1.0"?>

<project name="recommended" default="jar">

  <import file="../build-plugin.xml"/>

  <!-- for junit test -->
  <mkdir dir="${build.test}/data"/>
  <copy file="data/recommended.html" todir="${build.test}/data"/>


These lines will copy the test data to the proper directory for testing.
