How to Run tika-eval
...
-app on the VM
While users can run tika-eval-app on their own machines with their own documents, the Apache Tika, Apache PDFBox and Apache POI communities have gathered >1TB of documents from govdocs1 and from Common Crawl to serve as a regression testing corpus. Before a release, we'll run the last release against the candidate release to identify potential regressions.
This page is intended for committers/PMC members with access to the VM who want to run the regression tests. The example focuses on testing a SNAPSHOT version of PDFBox, but the steps are nearly identical for the full Tika eval or for sub projects. See TikaEval for more information on the tika-eval-app
module by itself. See this blog for a description of running this project on the Rackspace Tika's VM.
The driver appBatchExecutor.sh
, the various configuration files and the file lists for PDFs are all available here: batch-scripts.tgz.
If you haven't done so in your .bashrc file, make sure to umask g+rw
before running anything.
The main working directory is
...
: /data1/tools/tika/batch
...
An Example with Apache PDFBox
- Clean up from any previous runs
- Remove tika-app-X-Y.jar from /workdata1/batch-appstools/tika_working/bin/batch/bin – make sure to leave in the other "optional" jars:
jai-imageio-jpeg2000-1.4.0.jar, sqlite-jdbc-3.43.2.1.jar and zstd-jni-1.5.5-6.jar
- Remove or rename
/workdata1/batch-appstools/tika_working/batch/logs
- Remove or rename
/workdata1/batch-appstools/tika_working/batch/nohup.out
- Remove tika-app-X-Y.jar from /workdata1/batch-appstools/tika_working/bin/batch/bin – make sure to leave in the other "optional" jars:
- Run the current "A" version
- Place the "A" version of tika-app-X.Y.jar in
/workdata1/batch-appstools/tika_working/batch/bin
- Modify
appBatchExecutor.sh
to- put the output in a new output directory
-o /data4data1/batch_runsextracts/pdfboxA
- if using a file list, confirm that the correct file list is specified
-fileList pdf_files_single_colfileLists/ccAndBugTracker_pdfs.txt
- put the output in a new output directory
- Execute:
nohup ./appBatchExecutor.sh &
- Wait for the "A" version to complete before starting the "B" version
- Place the "A" version of tika-app-X.Y.jar in
- Build and run the "B" version
- Update PDFBox from SVN,
mvn clean install
- Update the PDFBox, Fontbox and Fontbox jbig2-imageio versions in the Tika project tika-parsers/pom.xml
- Run
mvn clean
on the whole Tika project and make sure that your IDE has picked up the changes - Run the PDFParser tests in tika-parsers/src/test/jvajava/aoo.a.t.pparsers.pdf.* to make sure that at least the Tika unit tests work.
- Build the entire Tika project (even though you'll only use tika-app.jar):
mvn clean install
- On the VM, remove the tika.app-A.jar from
/workdata1/batch-appstools/tika_working/batch/bin
, rename the existingnohup.out
tonohup-A.out
, rename/work/batch-apps/tika_working/logs
to/work/batch-apps/tika_working/logs/
tologs-A/
- Drop the new tika-app-B.jar into (you guessed it!):
/workdata1/batch-appstools/tika_working/batch/bin
- Modify
appBatchExecutor.sh
to- put the output in a new output directory
-o /data4data1/batch_runsextracts/pdfboxB
- if using a file list, confirm that the correct file list is specified
-fileList pdf_files_single_colfileLists/ccAndBugTracker_pdfs.txt
- put the output in a new output directory
- Execute:
nohup ./appBatchExecutor.sh &
- Wait for the "B" version to complete before starting the comparisons and reports
- Update PDFBox from SVN,
- Make the comparisons and report
- In
/workdata1/tools/tika/eval
, remove the existing db filepdfboxAvsB.mv.db
if you don't want to rename it. nohup java -jar tika-eval-app-X.Y.jar Compare -extractsA /data4data1/batch_runsextracts/pdfboxA -extractsB /data4data1/batch_runsextracts/pdfboxB -db pdfboxAvsB&
- When that completes,
- Remove any files left over from the last run in
reports/
:rm -r reports
- Write the reports
java -Djava.io.tmpdir=tmp -jar tika-eval-app-X.Y.jar Report -db pdfboxAvsB
– Note the -Djava.io.tmpdir=tmp – need to set the tmp directory to something writeable by 'collab'
- Remove any files left over from the last run in
- In
When this process completes, you'll have all of the reports written to /data1/tools/worktika/eval/reports
/.
H2 to Postgresql and Reports
...
I had to modify the report SQL slightly to work with Postgresql, and I stripped out some of the reports/calculations that aren't critical to the full regression tests. The modified report SQL is available comparison-reports_-pg.xml