You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 46 Current »

1) Hands-on tika-eval module workshop, Part 1

November 9, 2021, Tuesday 11am EST/4pm UTC

The dial-in information is available to those who register via Meetup.

This workshop is designed for hands-on tech folks who can run Tika from the commandline or can curl to a local tika-server.

Stay tuned for prerequisites, resources and an agenda!

The following is all a work in progress.  Please check back right before the workshop!

Prerequisites:
  1. java >= 8
  2. tika-eval app and tika-app jars: https://dlcdn.apache.org/tika/2.1.0/tika-eval-app-2.1.0.jar and https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
  3. JSON editor/viewer (jq should be sufficient. I like Sublime with the PrettyJSON plugin https://github.com/dzhibas/SublimePrettyJson)
  4. XLSX viewer (Excel or Open/LibreOffice)
Optional materials:
  1. tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
  2. tika-eval-core.jar: https://repo1.maven.org/maven2/org/apache/tika/tika-eval-core/2.1.0/tika-eval-core-2.1.0.jar
  3. If you'd like to experiment with tesseract, make sure that tesseract is installed and callable as 'tesseract' from your commandline.
  4. Some knowledge of SQL
Example docs, extracts and config files: tika-eval-workshop-20211109.tgz

Before the class, you should unzip the tika-eval-workshop-20211109.tgz (tar -xzvf tika-eval-workshop-20211109.tgz), move the tika-app-2.1.0.jar into the tika-eval-workshop-20211109/ folder and  run tika-app on the docs directory: java -jar tika-app-2.1.0.jar -J -t -i docs -o extracts/my_extracts 


Note: There's a bug in the default logging configuration for tika-app in batch mode (e.g. "No configuration found for '4b85612c' at 'null' in 'null'...").  This is fixed in the latest tika-app and will be available in the next release 2.1.1.


2) Hands-on tika-pipes module workshop

December 2, 2021, Thursday 12pm (NOON) EST/5pm UTC

The dial-in information is available to those who register via Meetup.

I'm currently working on this, and it should be ready by 11am EST/4pm UTC – an hour before the start

Useful documentation: tika-pipes

Prerequisites:

  1. java >= 8
  2. curl (or postman or something similar)
  3. Unzip tika-pipes-tutorial-20211202.zip 
  4. In tika-pipes-tutorial-20211202/app-bin/:
    1. https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
    4. tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
  5. In tika-pipes-tutorial-20211202/server-bin/:
    1. tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
    4. tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
  6. Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)

Advanced/Optional:

  1. jq or similar

Exercises

  1. Use fetcher in traditional /tika /rmeta endpoints
    1. update configs/tika-config-basic.xml <basePath> element to get the full path to tika-pipes-tutorial-20211202/docs:  

      FileSystemFetcher
        <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
          <params>
            <name>fsf</name>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
          </params>
        </fetcher>
    2. start the server: java -cp "server-bin/*" org.apache.tika.server.core.TikaServerCli -c configs/tika-config-basic.xml
    3. curl -X PUT http://localhost:9998/rmeta -H "fetcherName:fsf" -H "fetchKey:testPDF.pdf" | jq --sort-keys

  2. Use /pipes handler to read from and write to a local file share
    1. update configs/tika-config-basic.xml <basePath> element to get the full path to tika-pipes-tutorial-20211202/docs:  

      FileSystemEmitter
        <emitters>
          <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
            <params>
              <name>fse</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/extracts</basePath>
            </params>
          </emitter>
        </emitters>
    2. start the server: java -cp "server-bin/*" org.apache.tika.server.core.TikaServerCli -c configs/tika-config-basic.xml
    3. curl -X POST -H "Content-Type: application/json" -d @configs/pipes-request-minimal.json http://localhost:9998/pipes

  3. Configure metadata filters and rerun 2.
    1. Copy this and paste it into configs/tika-config-basic.xml

      Metadata Filters
      <metadataFilters>
        <!-- depending on the file format, some dates do not have a timezone. This
               filter arbitrarily assumes dates have a UTC timezone and will format all
               dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
          -->
        <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
        <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
          <params>
            <excludeUnmapped>true</excludeUnmapped>
            <mappings>
              <mapping from="X-TIKA:content" to="content_s"/>
              <mapping from="Content-Length" to="length_i"/>
              <mapping from="dc:creator" to="creators_ss"/>
              <mapping from="dc:title" to="title_s"/>
              <mapping from="Content-Type" to="mime_s"/>
              <mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception_s"/>
            </mappings>
          </params>
        </metadataFilter>
      </metadataFilters>
    2. Restart the server
    3. Rerun the curl command and look at the output (cat extracts/testPDF.pdf.json | jq --sort-keys)
  4. Use /async handler file share to file share
    1. curl -X POST -H "Content-Type: application/json" -d @configs/async-request-minimal.json http://localhost:9998/async

    2. curl -X POST -H "Content-Type: application/json" -d @configs/async-request-full.json http://localhost:9998/async

  5. Run the async processor via tika-app
    1. Configure the basePath element in FileSystemPipesIterator and FileSystemPipesIterator in configs/tika-config-app.xml

      FileSystemPipesIterator
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
            </params>
          </fetcher> 
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>fse</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
          </params>
        </pipesIterator>
    2. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app.xml

  6. Configure Solr/OpenSearch/ElasticSearch emitter and run /pipes handler


Solr Example (fileshare to Solr)

  1. From the solr directory
    1. bin/solr start
    2. bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
  2. From the tika-pipes-tutorial directory

    1. Set the schema in Solr: curl -F 'data=@configs/solr/solr-parent-child-schema.json' http://localhost:8983/solr/tika-example/schema

    2. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/solr/tika-config-solr.xml

      FileSystemPipesIterator
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
            </params>
          </fetcher>
      ...
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>solr1</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
          </params>
        </pipesIterator>
    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr.xml

OpenSearch/Elasticsearch Example (fileshare to OpenSearch/ElasticSearch)

  1. Start opensearch via Docker:
    1. docker pull opensearchproject/opensearch:1.2.0
    2. docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.0
  2. Curl schema to opensearch: 

    curl -k -T configs/opensearch/opensearch-parent-child-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test

  3. Configure the basePath element in FileSystemPipesIterator a nd FileSystemFetcher in configs/opensearch/tika-config-opensearch.xml

    FileSystemPipesIterator
        <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
          <params>
            <name>fsf</name>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
          </params>
        </fetcher>
    ....
      <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
        <params>
          <fetcherName>fsf</fetcherName>
          <emitterName>ose</emitterName>
          <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
        </params>
      </pipesIterator>

Helpful commands

  1. delete a collection in solr: 

    bin/solr delete -c collection_name

  2. Show all docs in OpenSearch/Elasticsearch: https://localhost:9200/tika-test/_search?pretty=true&q=*:*


3) Hands-on tika-pipes module workshop

January 24, 2022, Monday 11am EST/4pm UTC

The dial-in information is available to those who register via Meetup.

I'm currently working on this, and it should be ready by 10:30am EST/4:30pm UTC – a half hour before the start

Useful documentation: tika-pipes

Prerequisites:

  1. java >= 8
  2. curl (or postman or something similar)
  3. Unzip tika-pipes-tutorial-20211202.zip 
  4. In tika-pipes-tutorial-20220124/app-bin/:
    1. https://dlcdn.apache.org/tika/2.2.1/tika-app-2.2.1.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.2.1/tika-emitter-fs-2.2.1.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.2.1/tika-emitter-solr-2.2.1.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.2.1/tika-emitter-opensearch-2.2.1.jar
    4. https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/437/org.apache.tika$tika-core/artifact/org.apache.tika/tika-core/2.2.2-20220124.115541-55/tika-core-2.2.2-20220124.115541-55-test-jar-with-dependencies.jar
  5. Optional: In tika-pipes-tutorial-20220124/server-bin/:
    1. tika-server-standard jar: https://dlcdn.apache.org/tika/2.2.1/tika-server-standard-2.2.1.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.2.1/tika-emitter-fs-2.2.1.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.2.1/tika-emitter-solr-2.2.1.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.2.1/tika-emitter-opensearch-2.2.1.jar
    4. https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/437/org.apache.tika$tika-core/artifact/org.apache.tika/tika-core/2.2.2-20220124.115541-55/tika-core-2.2.2-20220124.115541-55-test-jar-with-dependencies.jar
  6. Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)

A) Fileshare to Fileshare warm up

  1. Run the async processor via tika-app
    1. Configure the basePath element in FileSystemPipesIterator and FileSystemPipesIterator in configs/tika-config-app-fs-to-fs.xml

      FileSystemPipesIterator
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
            </params>
          </fetcher> 
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>fse</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
          </params>
        </pipesIterator>
    2. Configure the basePath element in FileSystemPipesEmitter in configs/tika-config-app-fs-to-fs.xml

      FileSystemPipesIterator
        <emitters>
          <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
            <params>
              <name>fse</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/extracts</basePath>
            </params>
          </emitter>
        </emitters>
    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app-fs-to-fs.xml

B) OpenSearch/Elasticsearch Parent-Child Example (fileshare to OpenSearch/ElasticSearch)

  1. Start opensearch via Docker:
    1. docker pull opensearchproject/opensearch:1.2.4
    2. docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
  2. Curl schema to opensearch: 

    curl -k -T configs/opensearch/opensearch-parent-child-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-parent-child

  3. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/opensearch/tika-config-fs-to-opensearch-parent-child.xml

  4. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-parent-child.xml

B) Solr Parent-Child Example (fileshare to Solr)

  1. From the solr directory
    1. bin/solr start
    2. bin/solr create -c tika-example-parent-child && bin/solr config -c tika-example-parent-child -p 8983 -action set-user-property -property update.autoCreateFields -value false
  2. From the tika-pipes-tutorial directory

    1. Set the schema in Solr: curl -F 'data=@configs/solr/solr-parent-child-schema.json' http://localhost:8983/solr/tika-example-parent-child/schema

    2. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/solr/tika-config-solr-parent-child.xml

      FileSystemPipesIterator
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
            </params>
          </fetcher>
      ...
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>solr1</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
          </params>
        </pipesIterator>
    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-parent-child.xml

C) OpenSearch/Elasticsearch Individual Files Example (fileshare to OpenSearch/ElasticSearch)

  1. Start opensearch via Docker:
    1. docker pull opensearchproject/opensearch:1.2.4
    2. docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
  2. Curl schema to opensearch: 

    curl -k -T configs/opensearch/opensearch-indiv-files-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-indiv-files

  3. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/opensearch/tika-config-fs-to-opensearch-indiv-files.xml

  4. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-indiv-files.xml

C) Solr Indiv Files Example (fileshare to Solr)

  1. From the solr directory
    1. bin/solr start
    2. bin/solr create -c tika-example-indiv-files && bin/solr config -c tika-example-indiv-files -p 8983 -action set-user-property -property update.autoCreateFields -value false
  2. From the tika-pipes-tutorial directory

    1. Set the schema in Solr: curl -F 'data=@configs/solr/solr-indiv-files-schema.json' http://localhost:8983/solr/tika-example-indiv-files/schema

    2. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/solr/tika-config-solr-indiv-files.xml

      FileSystemPipesIterator
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
            </params>
          </fetcher>
      ...
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>solr1</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
          </params>
        </pipesIterator>
    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-indiv-files.xml

D) OpenSearch/Elasticsearch Legacy Example (fileshare to OpenSearch/ElasticSearch)

  1. Start opensearch via Docker:
    1. docker pull opensearchproject/opensearch:1.2.4
    2. docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
  2. Curl schema to opensearch: curl -k -T configs/opensearch/opensearch-legacy-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-legacy


  3. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/opensearch/tika-config-fs-to-opensearch-legacy.xml

  4. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-legacy.xml


D) Solr Legacy Example (fileshare to Solr)

  1. From the solr directory
    1. bin/solr start
    2. bin/solr create -c tika-example-legacy && bin/solr config -c tika-example-legacy -p 8983 -action set-user-property -property update.autoCreateFields -value false
  2. From the tika-pipes-tutorial directory

    1. Set the schema in Solr: curl -F 'data=@configs/solr/solr-legacy-schema.json' http://localhost:8983/solr/tika-example-legacy/schema

    2. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/solr/tika-config-solr-legacy.xml

      FileSystemPipesIterator
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
            </params>
          </fetcher>
      ...
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>solr1</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
          </params>
        </pipesIterator>
    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-legacy.xml



  • No labels