THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- CrawlBase
- Url → CrawlState
- CrawlState
- Current state fields
- CrawlHistory is a list of CrawlDatum objects ordered by reverse date
- CrawlDatum has Metadata
- CrawlList
- Url → CrawlHistory
- Separate from CrawlBase for Multiple concurrent crawls
- FetchedContent
- Url → BytesWritable, FetchStatus
- FetchStatus would be a status of the fetch, error codes, any fetch information. This would then be translated by another tool back into the CrawlBase. FetchStatus has Metadata.
- ParsedContent
- Url → MapWritable
Wiki Markup [MapWritable] would contain Text → Writable or Writable\[\] and would allow the parsing of all different types of elements within the content (href, headers, etc.)
- Processing
- Processing would take the ParsedContent and translate that into multiple specific data parts. These data parts aren't used by any part of the system except Scoring.
- Processing would be specific functions including updating the CrawlBase, peforming analysis on ParsedContent, Integration of data from other sources.
- Some processors would translate content into formats needed by scorers.
- Processors are not constrained by specific data structures to allow flexibility in analysis, updating, blocking or removal, and other forms of data processing. The only requirement is scoring programs must be able to access processing output data structure in a one to one relationship.
- Scoring
- Url → Field
- Url → Float
- Field is a name, value(s), and score, being Text, Text, and Float respectively.
- The fields become the fields that are indexed with the scores becoming field boosts.
- Scoring takes the specific data parts from analysis and outputs the above formats.
- Field needs lucene semantics.
- Indexing
- Indexing indexes Fields for a document according to the field values and boosts. Indexing does not determine either field values or boost values.
- Indexing aggregates document boosts to create a final document score.
...