Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview of the 'tika-eval-app' Module

This page offers a first draft of the documentation for the tika-eval-app module, which was recently initially added to Tika 1.15-SNAPSHOT.

The module is intended to offer insight from the output of a single extraction tool or to enable some comparisons between tools. This module is designed to be used to help with Tika, but it could be used to evaluate other tools as well.

As part of Tika's periodic regression testing, we run this module against ~3 million files (for committers/PMC interested in running the regression testing on our Rackspace vmregression vm, see TikaEvalOnVM). However, it will not scale to 100s of millions of files as it is currently designed. Patches are welcomed!

...

There are many tools for extracting text from various file formats, and even within a single tool there are usually countless parameters that can be tweaked. The goal of 'tika-eval' is to allow developers to quickly compare the output of:

  1. Two different tools
  2. 2. Two versions of the same tool ("Should we upgrade? Or are there problems with the newer version?")
  3. 3. Two runs with the same tool but with different settings ("Does increasing the DPI for OCR improve extraction? Let's try two runs, one with 200 DPI and one with 300") 4.
  4. Different tools against a gold standard

In addition to this "comparison mode", there is also plenty of information one can get from looking at a profile of a single run.

...

  • Exceptions – how many and of what category? Are these regular catchable exceptions, evil OOMs or permahangs?
  • Metadata – how many metadata values did we extract?
  • Embedded files/attachments – how many embedded files were found
  • Mime detection – how many of what types of files do we have? Where do we see discrepancies between tools?
  • Content – is the content extracted by tool A better than that extracted by tool B? On which files is there a big difference in extracted content?

The tika-eval module was initially developed for text only. For those interested in evaluating structure/style components (e.g. <title/> or <b/> elements), see TikaEvalAndStructuralComponents.

...

  1. Create a directory of extract files that mirrors your input directory. These files may be UTF-8 text files with '.txt' appended to the original file's name or they may be the RecursiveParserWrapper's '.json' representation: java -jar tika-app-X.Y.jar -J -t -i input_docs -o extracts 2.
  2. Profile the directory of extracts and create a local H2 database:
    java -jar tika-eval-X.Y.jar Profile -extracts extracts -db profiledb 3.
  3. Write reports from the database:
    java -jar tika-eval-X.Y.jar Report -db profiledb

You'll have a directory of .xlsx reports under the "reports" directoryNote: if you don't need the full tika-eval-app, you can get many of these statistics at parse time via the TikaEvalMetadataFilter (see: ModifyingContentWithHandlersAndMetadataFilters).

Comparing Output from Two Tools/Settings (Compare)

...

  1. Create two directories of extract files that mirror your input directory. These files may be UTF-8 text files with '.txt' appended to the original file's name or they may be the RecursiveParserWrapper's '.json' representation. 2.
  2. Compare the extract directory A with extract directory B and write results to a local H2 database:
    java -jar tika-eval-X.Y.jar Compare -extractsA extractsA -extractsB extractsB -db comparisondb 3.#3
  3. Write reports from the database:
    java -jar tika-eval-X.Y.jar Report -db comparisondb

You'll have a directory of .xlsx reports under the "reports" directory.

...

  1. Launch the H2 localhost server:
    java -jar tika-eval-X.Y.jar StartDB – this calls java -cp ... org.h2.tools.Console -web
  2. 2. Navigate a browser to http://localhost:8082 and enter the jdbc connector code followed by the full path to your db file:
    jdbc:h2:/C:/users/someone/mystuff/tika-eval/comparisondb

If your reaction is: "You call this a database?!", please open tickets and contribute to improving the structure.

...

Make sure that your common words have gone through the same analysis chain as specified by the Common Words analyzer in 'lucene-analyzers.json'!

Reading Extracts

alterExtract

...

  1. If the other tool extracts embedded content, you'd want to concatenate all the content within Tika's .json file for a fair comparison:
    java -jar tika-eval-X.Y.jar Compare -extractsA tika_1_14 -extractsB tika_1_15 -db comparisondb -alterExtract concatenate_content
  2. 2.#2 If the other tool does not extract embedded content, you'd only want to look at the first metadata object (representing the container file) in the .json file:
    java -jar tika-eval-X.Y.jar Compare -extractsA tika_1_14 -extractsB tika_1_15 -db comparisondb -alterExtract first_only

Min/Max Extract Size

You may find that some extracts are too big to fit in memory, in which case use -maxExtractSize <maxBytes>, or you may want to focus only on extracts that are greater than a minimum length: -minExtractSize <minBytes>.

...