1) Hands-on tika-eval module workshop, Part 1
November 9, 2021, Tuesday 11am EST/4pm UTC
The dial-in information is available to those who register via Meetup.
This workshop is designed for hands-on tech folks who can run Tika from the commandline or can curl
to a local tika-server.
Stay tuned for prerequisites, resources and an agenda!
The following is all a work in progress. Please check back right before the workshop!
Prerequisites:
- java >= 8
- tika-eval app and tika-app jars: https://dlcdn.apache.org/tika/2.1.0/tika-eval-app-2.1.0.jar and https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
- JSON editor/viewer (
jq
should be sufficient. I like Sublime with the PrettyJSON plugin https://github.com/dzhibas/SublimePrettyJson) - XLSX viewer (Excel or Open/LibreOffice)
Optional materials:
- tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
- tika-eval-core.jar: https://repo1.maven.org/maven2/org/apache/tika/tika-eval-core/2.1.0/tika-eval-core-2.1.0.jar
- If you'd like to experiment with tesseract, make sure that tesseract is installed and callable as 'tesseract' from your commandline.
- Some knowledge of SQL
Example docs, extracts and config files: tika-eval-workshop-20211109.tgz
Before the class, you should unzip the tika-eval-workshop-20211109.tgz (tar -xzvf tika-eval-workshop-20211109.tgz
), move the tika-app-2.1.0.jar
into the tika-eval-workshop-20211109/
folder and run tika-app on the docs
directory: java -jar tika-app-2.1.0.jar -J -t -i docs -o extracts/my_extracts
Note: There's a bug in the default logging configuration for tika-app in batch mode (e.g. "No configuration found for '4b85612c' at 'null' in 'null'..."
). This is fixed in the latest tika-app and will be available in the next release 2.1.1.
2) Hands-on tika-pipes module workshop
December 2, 2021, Thursday 12pm (NOON) EST/5pm UTC
The dial-in information is available to those who register via Meetup.
I'm currently working on this, and it should be ready by 11am EST/4pm UTC – an hour before the start
Prerequisites:
- java >= 8
- curl (or postman or something similar)
- create a working directory, e.g.
tika-pipes-tutorial
- In
tika-pipes-tutorial/app-bin/
:- https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
- tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
- In
tika-pipes-tutorial/server-bin/
:- tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
- tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
- Unzip configs.zip (to be supplied later today) here:
tika-pipes-tutorial/configs
- Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)
Advanced/Optional:
jq
or similar
Exercises
- Use fetcher in traditional
/tika /rmeta
endpointsupdate
configs/tika-config-basic.xml
<basePath
> element to get the full path totika-pipes-tutorial-20221202/docs:
- start the server:
java -cp "server-bin/*" org.apache.tika.server.core.Ti
kaServerCli -c
configs/tika-config-basic.xml
curl -X PUT http://localhost:9998/rmeta -H "fetcherName:fsf" -H "fetchKey:testPDF.pdf" | jq --sort-keys
- Use /pipes handler to read from and write to a local file share
update
configs/tika-config-basic.xml
<basePath
> element to get the full path totika-pipes-tutorial-20221202/docs:
- start the server:
java -cp "server-bin/*" org.apache.tika.server.core.Ti
kaServerCli -c
configs/tika-config-basic.xml
curl -X POST -H "Content-Type: application/json" -d @configs/pipes-request-minimal.json http://localhost:9998/pipes
- Configure metadata filters and rerun 2.
Copy this and paste it into
configs/tika-config-basic.xml
- Restart the server
- Rerun the curl command and look at the output (
cat extracts/testPDF.pdf.json | jq --sort-keys
)
- Use
/async
handler file share to file sharecurl -X POST -H "Content-Type: application/json" -d @configs/async-request-simple.json http://localhost:9998/async
curl -X POST -H "Content-Type: application/json" -d @configs/async-request-full.json http://localhost:9998/async
- Run the async processor via tika-app
Configure the
basePath
element inFileSystemPipesIterator
inconfigs/tika-config-app.xml
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app.xml
Configure Solr/OpenSearch/ElasticSearch emitter and run
/pipes
handler
Solr Example (fileshare to Solr)
- From the solr directory
bin/solr start
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
From the tika-pipes-tutorial directory
Set the schema in Solr:
curl -F 'data=@configs/solr/solr-parent-child-schema.json' http://localhost:8983/solr/tika-example/schema
Configure the
basePath
element inFileSystemPipesIterator
inconfigs/solr/tika-config-solr.xml
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr.xml