Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: update versions

How to Run tika-eval

...

-app on the VM

While users can run tika-eval-app on their own machines with their own documents, the Apache Tika, Apache PDFBox and Apache POI communities have gathered >1TB of documents from govdocs1 and from Common Crawl to serve as a regression testing corpus. Before a release, we'll run the last release against the candidate release to identify potential regressions.

This page is intended for committers/PMC members with access to the VM who want to run the regression tests. The example focuses on testing a SNAPSHOT version of PDFBox, but the steps are nearly identical for the full Tika eval or for sub projects. See TikaEval for more information on the tika-eval-app module by itself. See this blog for a description of running this project on the Rackspace Tika's VM.

The driver appBatchExecutor.sh, the various configuration files and the file lists for PDFs are all available here: batch-scripts.tgz.

...

  1. Clean up from any previous runs
    1. Remove tika-app-X-Y.jar from /data1/tools/tika/batch/bin – make sure to leave in the other "optional" jars: jai-imageio-jpeg2000-1.34.0.jar, sqlite-jdbc-3.3243.32.21.jar and zstd-jni-1.45.5-6.jar
    2. Remove or rename /data1/tools/tika/batch/logs
    3. Remove or rename /data1/tools/tika/batch/nohup.out
  2. Run the current "A" version
    1. Place the "A" version of tika-app-X.Y.jar in /data1/tools/tika/batch/bin
    2. Modify appBatchExecutor.sh to
      1. put the output in a new output directory -o /data1/extracts/pdfboxA
      2. if using a file list, confirm that the correct file list is specified -fileList fileLists/ccAndBugTracker_pdfs.txt
    3. Execute: nohup ./appBatchExecutor.sh &
    4. Wait for the "A" version to complete before starting the "B" version
  3. Build and run the "B" version
    1. Update PDFBox from SVN, mvn clean install
    2. Update the PDFBox, Fontbox and jbig2-imageio versions in the Tika project tika-parsers/pom.xml
    3. Run mvn clean on the whole Tika project and make sure that your IDE has picked up the changes
    4. Run the PDFParser tests in tika-parsers/src/test/java/o.a.t.parsers.pdf.* to make sure that at least the Tika unit tests work.
    5. Build the entire Tika project (even though you'll only use tika-app.jar): mvn clean install
    6. On the VM, remove the tika.app-A.jar from /data1/tools/tika/batch/bin, rename the existing nohup.out to nohup-A.out, rename logs/ to logs-A/
    7. Drop the new tika-app-B.jar into (you guessed it!): /data1/tools/tika/batch/bin
    8. Modify appBatchExecutor.sh to
      1. put the output in a new output directory -o /data1/extracts/pdfboxB
      2. if using a file list, confirm that the correct file list is specified -fileList fileLists/ccAndBugTracker_pdfs.txt
    9. Execute: nohup ./appBatchExecutor.sh &
    10. Wait for the "B" version to complete before starting the comparisons and reports
  4. Make the comparisons and report
    1. In /data1/tools/tika/eval, remove the existing db file pdfboxAvsB.mv.db if you don't want to rename it.
    2. nohup java -jar tika-eval-app-X.Y.jar Compare -extractsA /data1/extracts/pdfboxA -extractsB /data1/extracts/pdfboxB -db pdfboxAvsB&
    3. When that completes,
      1. Remove any files left over from the last run in reports/: rm -r reports
      2. Write the reports java -Djava.io.tmpdir=tmp -jar tika-eval-app-X.Y.jar Report -db pdfboxAvsBNote the -Djava.io.tmpdir=tmp – need to set the tmp directory to something writeable by 'collab'

...