Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

If this became standard and annotations annotators could depend on it, then better extraction quality would result. For example, HTML is usually converted to plain text in which boundaries between table cells are lost. If instead the table structure was represented using ElementAnnotations, then an annotator might decide that the boundaries of ElementAnnotations named "TD" (i.e. HTML cells) are actually paragraph terminators.

...

The document parser would create these FS's from the document content and/or the document container (file system, content management system, HTTP headers, HTML META elements, etc.).

This would replace the current class SourceDocumentInformation.