Introduction

droids-crawler is a Droids module for web crawling. It is designed to utilize droids-core for queue and task management and provide crawler specific functions.

Remarks: for the current (initial) codebase, it is not integrated with droids-core yet.

Concepts

The following diagram illustrates the high-level concepts of a Droids Crawler:

Crawler Controller

A crawler controller is the control unit of the whole crawling operation.

  • Queue
  • Task Master
  • Worker
Crawler Service

A crawler service performs low-level crawling operations as follows:

  • Fetcher
  • Parser
  • Extractor

The crawler service may or may not run in the same JVM as the controller. In fact, in the initial release, it is designed to run in Google App Engine over the Internet.

Other Concepts
  • Link
  • Filter