You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 22 Next »

1) Hands-on tika-eval module workshop, Part 1

November 9, 2021, Tuesday 11am EST/4pm UTC

The dial-in information is available to those who register via Meetup.

This workshop is designed for hands-on tech folks who can run Tika from the commandline or can curl to a local tika-server.

Stay tuned for prerequisites, resources and an agenda!

The following is all a work in progress.  Please check back right before the workshop!

Prerequisites:
  1. java >= 8
  2. tika-eval app and tika-app jars: https://dlcdn.apache.org/tika/2.1.0/tika-eval-app-2.1.0.jar and https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
  3. JSON editor/viewer (jq should be sufficient. I like Sublime with the PrettyJSON plugin https://github.com/dzhibas/SublimePrettyJson)
  4. XLSX viewer (Excel or Open/LibreOffice)
Optional materials:
  1. tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
  2. tika-eval-core.jar: https://repo1.maven.org/maven2/org/apache/tika/tika-eval-core/2.1.0/tika-eval-core-2.1.0.jar
  3. If you'd like to experiment with tesseract, make sure that tesseract is installed and callable as 'tesseract' from your commandline.
  4. Some knowledge of SQL
Example docs, extracts and config files: tika-eval-workshop-20211109.tgz

Before the class, you should unzip the tika-eval-workshop-20211109.tgz (tar -xzvf tika-eval-workshop-20211109.tgz), move the tika-app-2.1.0.jar into the tika-eval-workshop-20211109/ folder and  run tika-app on the docs directory: java -jar tika-app-2.1.0.jar -J -t -i docs -o extracts/my_extracts 


Note: There's a bug in the default logging configuration for tika-app in batch mode (e.g. "No configuration found for '4b85612c' at 'null' in 'null'...").  This is fixed in the latest tika-app and will be available in the next release 2.1.1.


2) Hands-on tika-pipes module workshop

December 2, 2021, Thursday 12pm (NOON) EST/5pm UTC

The dial-in information is available to those who register via Meetup.

I'm currently working on this, and it should be ready by 11am EST/4pm UTC – an hour before the start

Prerequisites:
  1. java >= 8
  2. curl (or postman or something similar)
  3. create a working directory, e.g. tika-pipes-tutorial
  4. In tika-pipes-tutorial/app-bin/:
    1. https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
    4. tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
  5. In tika-pipes-tutorial/server-bin/:
    1. tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
    4. tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
  6. Unzip configs.zip (to be supplied later today) here: tika-pipes-tutorial/configs
  7. Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)
Advanced/Optional:
  1. tika-test-jar
  2. <mock>oom</mock>

Exercises

  1. Use fetcher in traditional /tika /rmeta endpoints
    1. start the server: java -cp "server-bin/*" org.apache.tika.server.core.TikaServerCli -c
      tika-config-basic.xml
    2. curl -X PUT http://localhost:9998/rmeta -H "fetcherName:fsf" -H
      "fetchKey:testPDF.pdf" | jq --sort-keys
  2. Use /pipes handler to read from and write to a local file share
  3. Configure metadata handler and rerun 2.
  4. Use /async handler file share to file share
  5. Configure Solr/OpenSearch/ElasticSearch emitter and run /pipes handler
  6. Run the async processor via tika-app




  • No labels