Apache Tika Meetups

1) Hands-on tika-eval module workshop, Part 1

November 9, 2021, Tuesday 11am EST/4pm UTC

The dial-in information is available to those who register via Meetup.

This workshop is designed for hands-on tech folks who can run Tika from the commandline or can curl to a local tika-server.

Stay tuned for prerequisites, resources and an agenda!

The following is all a work in progress. Please check back right before the workshop!

Prerequisites:

java >= 8
tika-eval app and tika-app jars: https://dlcdn.apache.org/tika/2.1.0/tika-eval-app-2.1.0.jar and https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
JSON editor/viewer (jq should be sufficient. I like Sublime with the PrettyJSON plugin https://github.com/dzhibas/SublimePrettyJson)
XLSX viewer (Excel or Open/LibreOffice)

Optional materials:

tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
tika-eval-core.jar: https://repo1.maven.org/maven2/org/apache/tika/tika-eval-core/2.1.0/tika-eval-core-2.1.0.jar
If you'd like to experiment with tesseract, make sure that tesseract is installed and callable as 'tesseract' from your commandline.
Some knowledge of SQL

Example docs, extracts and config files: tika-eval-workshop-20211109.tgz

Before the class, you should unzip the tika-eval-workshop-20211109.tgz (tar -xzvf tika-eval-workshop-20211109.tgz), move the tika-app-2.1.0.jar into the tika-eval-workshop-20211109/ folder and run tika-app on the docs directory: java -jar tika-app-2.1.0.jar -J -t -i docs -o extracts/my_extracts

Note: There's a bug in the default logging configuration for tika-app in batch mode (e.g. "No configuration found for '4b85612c' at 'null' in 'null'..."). This is fixed in the latest tika-app and will be available in the next release 2.1.1.

2) Hands-on tika-pipes module workshop

December 2, 2021, Thursday 12pm (NOON) EST/5pm UTC

The dial-in information is available to those who register via Meetup.

I'm currently working on this, and it should be ready by 11am EST/4pm UTC – an hour before the start

Prerequisites:

java >= 8
curl (or postman or something similar)
create a working directory, e.g. tika-pipes-tutorial
In tika-pipes-tutorial/app-bin/:
In tika-pipes-tutorial/server-bin/:
Unzip configs.zip (to be supplied later today) here: tika-pipes-tutorial/configs
Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)

Advanced/Optional:

jq or similar

Exercises

Use fetcher in traditional /tika /rmeta endpoints
1. update configs/tika-config-basic.xml <basePath> element to get the full path to tika-pipes-tutorial-20221202/docs:
  FileSystemFetcher Expand source
```
  <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
    <params>
      <name>fsf</name>
      <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20221202/docs</basePath>
    </params>
  </fetcher>
```
2. start the server: java -cp "server-bin/*" org.apache.tika.server.core.TikaServerCli -c configs/tika-config-basic.xml
3. curl -X PUT http://localhost:9998/rmeta -H "fetcherName:fsf" -H "fetchKey:testPDF.pdf" | jq --sort-keys
Use /pipes handler to read from and write to a local file share
1. update configs/tika-config-basic.xml <basePath> element to get the full path to tika-pipes-tutorial-20221202/docs:
  FileSystemEmitter Expand source
```
  <emitters>
    <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
      <params>
        <name>fse</name>
        <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20221202/extracts</basePath>
      </params>
    </emitter>
  </emitters>
```
2. start the server: java -cp "server-bin/*" org.apache.tika.server.core.TikaServerCli -c configs/tika-config-basic.xml
3. curl -X POST -H "Content-Type: application/json" -d @configs/pipes-request-minimal.json http://localhost:9998/pipes

Configure metadata filters and rerun 2.

Copy this and paste it into configs/tika-config-basic.xml

Metadata Filters

<metadataFilters>
  <!-- depending on the file format, some dates do not have a timezone. This
         filter arbitrarily assumes dates have a UTC timezone and will format all
         dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
    -->
  <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
  <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
    <params>
      <excludeUnmapped>true</excludeUnmapped>
      <mappings>
        <mapping from="X-TIKA:content" to="content_s"/>
        <mapping from="Content-Length" to="length_i"/>
        <mapping from="dc:creator" to="creators_ss"/>
        <mapping from="dc:title" to="title_s"/>
        <mapping from="Content-Type" to="mime_s"/>
        <mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception_s"/>
      </mappings>
    </params>
  </metadataFilter>
</metadataFilters>

Restart the server
Rerun the curl command and look at the output (cat extracts/testPDF.pdf.json | jq --sort-keys)

Use /async handler file share to file share
1. curl -X POST -H "Content-Type: application/json" -d @configs/async-request-simple.json http://localhost:9998/async
2. curl -X POST -H "Content-Type: application/json" -d @configs/async-request-full.json http://localhost:9998/async

Run the async processor via tika-app

Configure the basePath element in FileSystemPipesIterator in configs/tika-config-app.xml

Metadata Filters

  <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
    <params>
      <fetcherName>fsf</fetcherName>
      <emitterName>fse</emitterName>
      <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20221202/docs</basePath>
    </params>
  </pipesIterator>

java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app.xml

Configure Solr/OpenSearch/ElasticSearch emitter and run /pipes handler

Page tree

1) Hands-on tika-eval module workshop, Part 1

November 9, 2021, Tuesday 11am EST/4pm UTC

Prerequisites:

Optional materials:

Example docs, extracts and config files: tika-eval-workshop-20211109.tgz

2) Hands-on tika-pipes module workshop

December 2, 2021, Thursday 12pm (NOON) EST/5pm UTC

Prerequisites:

Advanced/Optional:

Exercises

Helpful commands