You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 30 Next »

1) Hands-on tika-eval module workshop, Part 1

November 9, 2021, Tuesday 11am EST/4pm UTC

The dial-in information is available to those who register via Meetup.

This workshop is designed for hands-on tech folks who can run Tika from the commandline or can curl to a local tika-server.

Stay tuned for prerequisites, resources and an agenda!

The following is all a work in progress.  Please check back right before the workshop!

Prerequisites:
  1. java >= 8
  2. tika-eval app and tika-app jars: https://dlcdn.apache.org/tika/2.1.0/tika-eval-app-2.1.0.jar and https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
  3. JSON editor/viewer (jq should be sufficient. I like Sublime with the PrettyJSON plugin https://github.com/dzhibas/SublimePrettyJson)
  4. XLSX viewer (Excel or Open/LibreOffice)
Optional materials:
  1. tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
  2. tika-eval-core.jar: https://repo1.maven.org/maven2/org/apache/tika/tika-eval-core/2.1.0/tika-eval-core-2.1.0.jar
  3. If you'd like to experiment with tesseract, make sure that tesseract is installed and callable as 'tesseract' from your commandline.
  4. Some knowledge of SQL
Example docs, extracts and config files: tika-eval-workshop-20211109.tgz

Before the class, you should unzip the tika-eval-workshop-20211109.tgz (tar -xzvf tika-eval-workshop-20211109.tgz), move the tika-app-2.1.0.jar into the tika-eval-workshop-20211109/ folder and  run tika-app on the docs directory: java -jar tika-app-2.1.0.jar -J -t -i docs -o extracts/my_extracts 


Note: There's a bug in the default logging configuration for tika-app in batch mode (e.g. "No configuration found for '4b85612c' at 'null' in 'null'...").  This is fixed in the latest tika-app and will be available in the next release 2.1.1.


2) Hands-on tika-pipes module workshop

December 2, 2021, Thursday 12pm (NOON) EST/5pm UTC

The dial-in information is available to those who register via Meetup.

I'm currently working on this, and it should be ready by 11am EST/4pm UTC – an hour before the start

Prerequisites:

  1. java >= 8
  2. curl (or postman or something similar)
  3. create a working directory, e.g. tika-pipes-tutorial
  4. In tika-pipes-tutorial/app-bin/:
    1. https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
    4. tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
  5. In tika-pipes-tutorial/server-bin/:
    1. tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
    4. tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
  6. Unzip configs.zip (to be supplied later today) here: tika-pipes-tutorial/configs
  7. Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)

Advanced/Optional:

  1. jq or similar

Exercises

  1. Use fetcher in traditional /tika /rmeta endpoints
    1. update configs/tika-config-basic.xml <basePath> element to get the full path to tika-pipes-tutorial-20221202/docs:  

      FileSystemFetcher
        <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
          <params>
            <name>fsf</name>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20221202/docs</basePath>
          </params>
        </fetcher>
    2. start the server: java -cp "server-bin/*" org.apache.tika.server.core.TikaServerCli -c configs/tika-config-basic.xml
    3. curl -X PUT http://localhost:9998/rmeta -H "fetcherName:fsf" -H "fetchKey:testPDF.pdf" | jq --sort-keys

  2. Use /pipes handler to read from and write to a local file share
    1. update configs/tika-config-basic.xml <basePath> element to get the full path to tika-pipes-tutorial-20221202/docs:  

      FileSystemEmitter
        <emitters>
          <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
            <params>
              <name>fse</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20221202/extracts</basePath>
            </params>
          </emitter>
        </emitters>
    2. start the server: java -cp "server-bin/*" org.apache.tika.server.core.TikaServerCli -c configs/tika-config-basic.xml
    3. curl -X POST -H "Content-Type: application/json" -d @configs/pipes-request-minimal.json http://localhost:9998/pipes

  3. Configure metadata filters and rerun 2.
    1. Copy this and paste it into configs/tika-config-basic.xml

      Metadata Filters
      <metadataFilters>
        <!-- depending on the file format, some dates do not have a timezone. This
               filter arbitrarily assumes dates have a UTC timezone and will format all
               dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
          -->
        <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
        <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
          <params>
            <excludeUnmapped>true</excludeUnmapped>
            <mappings>
              <mapping from="X-TIKA:content" to="content_s"/>
              <mapping from="Content-Length" to="length_i"/>
              <mapping from="dc:creator" to="creators_ss"/>
              <mapping from="dc:title" to="title_s"/>
              <mapping from="Content-Type" to="mime_s"/>
              <mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception_s"/>
            </mappings>
          </params>
        </metadataFilter>
      </metadataFilters>
    2. Restart the server
    3. Rerun the curl command and look at the output (cat extracts/testPDF.pdf.json | jq --sort-keys)
  4. Use /async handler file share to file share
    1. curl -X POST -H "Content-Type: application/json" -d @configs/async-request-simple.json http://localhost:9998/async

    2. curl -X POST -H "Content-Type: application/json" -d @configs/async-request-full.json http://localhost:9998/async

  5. Run the async processor via tika-app
    1. Configure the basePath element in FileSystemPipesIterator in configs/tika-config-app.xml

      Metadata Filters
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>fse</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20221202/docs</basePath>
          </params>
        </pipesIterator>
    2. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app.xml

  6. Configure Solr/OpenSearch/ElasticSearch emitter and run /pipes handler

Helpful commands





  • No labels