How to Run tika-eval

...

-app on the VM

While users can run tika-eval-app on their own machines with their own documents, the Apache Tika, Apache PDFBox and Apache POI communities have gathered >1TB of documents from govdocs1 and from Common Crawl to serve as a regression testing corpus. Before a release, we'll run the last release against the candidate release to identify potential regressions.

This page is intended for committers/PMC members with access to the VM who want to run the regression tests. The example focuses on testing a SNAPSHOT version of PDFBox, but the steps are nearly identical for the full Tika eval or for sub projects. See TikaEval for more information on the tika-eval-app module by itself. See this blog for a description of running this project on the Rackspace Tika's VM.

The driver appBatchExecutor.sh, the various configuration files and the file lists for PDFs are all available here: batch-scripts.tgz.

If you haven't done so in your .bashrc file, make sure to umask g+rw before running anything.

The main working directory is

...

: /data1/tools/tika/batch

...

An Example with Apache PDFBox

Clean up from any previous runs
1. Remove tika-app-X-Y.jar from /workdata1/batch-appstools/tika_working/bin/batch/bin – make sure to leave in the other "optional" jars: jai-imageio-jpeg2000-1.4.0.jar, sqlite-jdbc-3.43.2.1.jar and zstd-jni-1.5.5-6.jar
2. Remove or rename /workdata1/batch-appstools/tika_working/batch/logs
3. Remove or rename /workdata1/batch-appstools/tika_working/batch/nohup.out
Run the current "A" version
1. Place the "A" version of tika-app-X.Y.jar in /workdata1/batch-appstools/tika_working/batch/bin
2. Modify appBatchExecutor.sh to
  1. put the output in a new output directory -o /data4data1/batch_runsextracts/pdfboxA
  2. if using a file list, confirm that the correct file list is specified -fileList pdf_files_single_colfileLists/ccAndBugTracker_pdfs.txt
3. Execute: nohup ./appBatchExecutor.sh &
4. Wait for the "A" version to complete before starting the "B" version
Build and run the "B" version
1. Update PDFBox from SVN, mvn clean install
2. Update the PDFBox, Fontbox and Fontbox jbig2-imageio versions in the Tika project tika-parsers/pom.xml
3. Run mvn clean on the whole Tika project and make sure that your IDE has picked up the changes
4. Run the PDFParser tests in tika-parsers/src/test/jvajava/aoo.a.t.pparsers.pdf.* to make sure that at least the Tika unit tests work.
5. Build the entire Tika project (even though you'll only use tika-app.jar): mvn clean install
6. On the VM, remove the tika.app-A.jar from /workdata1/batch-appstools/tika_working/batch/bin, rename the existing nohup.out to nohup-A.out, rename /work/batch-apps/tika_working/logs to /work/batch-apps/tika_working/logs/ to logs-A/
7. Drop the new tika-app-B.jar into (you guessed it!): /workdata1/batch-appstools/tika_working/batch/bin
8. Modify appBatchExecutor.sh to
  1. put the output in a new output directory -o /data4data1/batch_runsextracts/pdfboxB
  2. if using a file list, confirm that the correct file list is specified -fileList pdf_files_single_colfileLists/ccAndBugTracker_pdfs.txt
9. Execute: nohup ./appBatchExecutor.sh &
10. Wait for the "B" version to complete before starting the comparisons and reports
Make the comparisons and report
1. In /workdata1/tools/tika/eval, remove the existing db file pdfboxAvsB.mv.db if you don't want to rename it.
2. nohup java -jar tika-eval-app-X.Y.jar Compare -extractsA /data4data1/batch_runsextracts/pdfboxA -extractsB /data4data1/batch_runsextracts/pdfboxB -db pdfboxAvsB&
3. When that completes,
  1. Remove any files left over from the last run in reports/: rm -r reports
  2. Write the reports java -Djava.io.tmpdir=tmp -jar tika-eval-app-X.Y.jar Report -db pdfboxAvsB – Note the -Djava.io.tmpdir=tmp – need to set the tmp directory to something writeable by 'collab'

When this process completes, you'll have all of the reports written to /data1/tools/worktika/eval/reports/.

H2 to Postgresql and Reports

...

I had to modify the report SQL slightly to work with Postgresql, and I stripped out some of the reports/calculations that aren't critical to the full regression tests. The modified report SQL is available comparison-reports_-pg.xml

Page tree

Versions Compared

Old Version 2

New Version Current

Key

How to Run tika-eval

-app on the VM

An Example with Apache PDFBox

H2 to Postgresql and Reports

Page tree

Page History

Versions Compared

Old Version 2

New Version Current

Key

How to Run tika-eval

-app on the VM

An Example with Apache PDFBox

H2 to Postgresql and Reports