Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

I'm currently working on this, and it should be ready by 11am EST/4pm UTC – an hour before the start

Useful documentation: tika-pipes

Prerequisites:

  1. java >= 8
  2. curl (or postman or something similar)
  3. create a working directory, e.g. Unzip tika-pipes-tutorial-20211202.zip 
  4. In tika-pipes-tutorial-20211202/app-bin/:
    1. https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
    4. tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
  5. In tika-pipes-tutorial-20211202/server-bin/:
    1. tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
    4. tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
    Unzip configs.zip (to be supplied later today) here: tika-pipes-tutorial/configs
  6. Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)

Advanced/Optional:

  1. jq or similar

Exercises

  1. Use fetcher in traditional /tika /rmeta endpoints
    1. update configs/tika-config-basic.xml <basePath> element to get the full path to tika-pipes-tutorial-2022120220211202/docs:   

      Code Block
      languagexml
      titleFileSystemFetcher
      collapsetrue
        <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
          <params>
            <name>fsf</name>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-2022120220211202/docs</basePath>
          </params>
        </fetcher>


    2. start the server: java -cp "server-bin/*" org.apache.tika.server.core.TikaServerCli -c configs/tika-config-basic.xml
    3. curl -X PUT

       

      http://localhost:9998/rmeta

       

      -H "fetcherName:fsf" -H


      "fetchKey:testPDF.pdf" | jq --sort-keys

  2. Use /pipes handler to read from and write to a local file share
    1. update configs/tika-config-basic.xml <basePath> element to get the full path to tika-pipes-tutorial-2022120220211202/docs:  

      Code Block
      languagexml
      titleFileSystemEmitter
      collapsetrue
        <emitters>
          <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
            <params>
              <name>fse</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-2022120220211202/extracts</basePath>
            </params>
          </emitter>
        </emitters>


    2. start the server: java -cp "server-bin/*" org.apache.tika.server.core.TikaServerCli -c configs/tika-config-basic.xml
    3. commandline TBD

      curl -X POST -H "Content-Type: application/json" -d @configs/pipes-request-minimal.json http://localhost:9998/pipes

  3. Configure metadata handler and rerun 2.filters and rerun 2.
    1. Copy this and paste it into configs/tika-config-basic.xml

      Code Block
      languagexml
      titleMetadata Filters
      collapsetrue
      <metadataFilters>
        <!-- depending on the file format, some dates do not have a timezone. This
               filter arbitrarily assumes dates have a UTC timezone and will format all
               dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone.
          -->
        <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/>
        <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter">
          <params>
            <excludeUnmapped>true</excludeUnmapped>
            <mappings>
              <mapping from="X-TIKA:content" to="content_s"/>
              <mapping from="Content-Length" to="length_i"/>
              <mapping from="dc:creator" to="creators_ss"/>
              <mapping from="dc:title" to="title_s"/>
              <mapping from="Content-Type" to="mime_s"/>
              <mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception_s"/>
            </mappings>
          </params>
        </metadataFilter>
      </metadataFilters>


    2. Restart the server
    3. Rerun the curl command and look at the output (cat extracts/testPDF.pdf.json | jq --sort-keys)
  4. Use /async handler file share to file share
    1. curl -X POST -H "Content-Type: application/json" -d @configs/async-request-minimal.json http://localhost:9998/async

    2. curl -X POST -H "Content-Type: application/json" -d @configs/async-request-full.json http://localhost:9998/async

  5. Run the async processor via tika-app
    1. Configure the basePath element in FileSystemPipesIterator and FileSystemPipesIterator in configs/tika-config-app.xml

      Code Block
      languagexml
      titleFileSystemPipesIterator
      collapsetrue
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
            </params>
          </fetcher> 
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>fse</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
          </params>
        </pipesIterator>


    2. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app.xml

  6. Configure Solr/OpenSearch/ElasticSearch emitter and run /pipes handler


Solr Example (fileshare to Solr)

  1. From the solr directory
    1. bin/solr start
    2. bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
  2. From the tika-pipes-tutorial directory

    1. Set the schema in Solr: curl -F 'data=@configs/solr/solr-parent-child-schema.json' http://localhost:8983/solr/tika-example/schema

    2. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/solr/tika-config-solr.xml

      Code Block
      languagexml
      titleFileSystemPipesIterator
      collapsetrue
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
            </params>
          </fetcher>
      ...
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>solr1</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
          </params>
        </pipesIterator>


    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr.xml

OpenSearch/Elasticsearch Example (fileshare to OpenSearch/ElasticSearch)

  1. Start opensearch via Docker:
    1. docker pull opensearchproject/opensearch:1.2.0
    2. docker run -p 9200:9200 -p 9600:9600 -e"discovery.type=single-node" opensearchproject/opensearch:1.2.0
  2. Curl schema to opensearch: 

    curl -k -T configs/opensearch/opensearch-parent-child-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test

  3. Configure the basePath element in FileSystemPipesIterator a nd FileSystemFetcher in configs/opensearch/tika-config-opensearch.xml

    Code Block
    languagexml
    titleFileSystemPipesIterator
    collapsetrue
        <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
          <params>
            <name>fsf</name>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
          </params>
        </fetcher>
    ....
      <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
        <params>
          <fetcherName>fsf</fetcherName>
          <emitterName>ose</emitterName>
          <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath>
        </params>
      </pipesIterator>


Helpful commands

  1. delete a collection in solr: 

    bin/solr delete -c collection_name

  2. Show all docs in OpenSearch/Elasticsearch: https://localhost:9200/tika-test/_search?pretty=true&q=*:*


3) Hands-on tika-pipes module workshop

January 24, 2022, Monday 11am EST/4pm UTC

The dial-in information is available to those who register via Meetup.

I'm currently working on this, and it should be ready by 10:30am EST/4:30pm UTC – a half hour before the start

Useful documentation: tika-pipes

Prerequisites:

  1. java >= 8
  2. curl (or postman or something similar)
  3. Unzip tika-pipes-tutorial-20220124.tgz
  4. In tika-pipes-tutorial-20220124/app-bin/:
    1. https://dlcdn.apache.org/tika/2.2.1/tika-app-2.2.1.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.2.1/tika-emitter-fs-2.2.1.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.2.1/tika-emitter-solr-2.2.1.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.2.1/tika-emitter-opensearch-2.2.1.jar
    4. https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/437/org.apache.tika$tika-core/artifact/org.apache.tika/tika-core/2.2.2-20220124.115541-55/tika-core-2.2.2-20220124.115541-55-test-jar-with-dependencies.jar
  5. Optional: In tika-pipes-tutorial-20220124/server-bin/:
    1. tika-server-standard jar: https://dlcdn.apache.org/tika/2.2.1/tika-server-standard-2.2.1.jar
    2. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.2.1/tika-emitter-fs-2.2.1.jar
    3. https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.2.1/tika-emitter-solr-2.2.1.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.2.1/tika-emitter-opensearch-2.2.1.jar
    4. https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/437/org.apache.tika$tika-core/artifact/org.apache.tika/tika-core/2.2.2-20220124.115541-55/tika-core-2.2.2-20220124.115541-55-test-jar-with-dependencies.jar
  6. Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)

A) Fileshare to Fileshare warm up

  1. Run the async processor via tika-app
    1. Configure the basePath element in FileSystemPipesIterator and FileSystemPipesIterator in configs/tika-config-app-fs-to-fs.xml

      Code Block
      languagexml
      titleFileSystemPipesIterator
      collapsetrue
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
            </params>
          </fetcher> 
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>fse</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
          </params>
        </pipesIterator>


    2. Configure the basePath element in FileSystemPipesEmitter in configs/tika-config-app-fs-to-fs.xml

      Code Block
      languagexml
      titleFileSystemPipesIterator
      collapsetrue
        <emitters>
          <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter">
            <params>
              <name>fse</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/extracts</basePath>
            </params>
          </emitter>
        </emitters>


    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app-fs-to-fs.xml

B) OpenSearch/Elasticsearch Parent-Child Example (fileshare to OpenSearch/ElasticSearch)

  1. Start opensearch via Docker:
    1. docker pull opensearchproject/opensearch:1.2.4
    2. docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
  2. Curl schema to opensearch: 

    curl -k -T configs/opensearch/opensearch-parent-child-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-parent-child

  3. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/opensearch/tika-config-fs-to-opensearch-parent-child.xml

  4. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-parent-child.xml

B) Solr Parent-Child Example (fileshare to Solr)

  1. From the solr directory
    1. bin/solr start
    2. bin/solr create -c tika-example-parent-child && bin/solr config -c tika-example-parent-child -p 8983 -action set-user-property -property update.autoCreateFields -value false
  2. From the tika-pipes-tutorial directory

    1. Set the schema in Solr: curl -F 'data=@configs/solr/solr-parent-child-schema.json' http://localhost:8983/solr/tika-example-parent-child/schema

    2. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/solr/tika-config-solr-parent-child.xml

      Code Block
      languagexml
      titleFileSystemPipesIterator
      collapsetrue
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
            </params>
          </fetcher>
      ...
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>solr1</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
          </params>
        </pipesIterator>


    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-parent-child.xml

C) OpenSearch/Elasticsearch Individual Files Example (fileshare to OpenSearch/ElasticSearch)

  1. Start opensearch via Docker:
    1. docker pull opensearchproject/opensearch:1.2.4
    2. docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
  2. Curl schema to opensearch: 

    curl -k -T configs/opensearch/opensearch-indiv-files-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-indiv-files

  3. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/opensearch/tika-config-fs-to-opensearch-indiv-files.xml

  4. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-indiv-files.xml

C) Solr Indiv Files Example (fileshare to Solr)

  1. From the solr directory
    1. bin/solr start
    2. bin/solr create -c tika-example-indiv-files && bin/solr config -c tika-example-indiv-files -p 8983 -action set-user-property -property update.autoCreateFields -value false
  2. From the tika-pipes-tutorial directory

    1. Set the schema in Solr: curl -F 'data=@configs/solr/solr-indiv-files-schema.json' http://localhost:8983/solr/tika-example-indiv-files/schema

    2. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/solr/tika-config-solr-indiv-files.xml

      Code Block
      languagexml
      titleFileSystemPipesIterator
      collapsetrue
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
            </params>
          </fetcher>
      ...
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>solr1</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
          </params>
        </pipesIterator>


    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-indiv-files.xml

D) OpenSearch/Elasticsearch Legacy Example (fileshare to OpenSearch/ElasticSearch)

  1. Start opensearch via Docker:
    1. docker pull opensearchproject/opensearch:1.2.4
    2. docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
  2. Curl schema to opensearch: curl -k -T configs/opensearch/opensearch-legacy-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-legacy


  3. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/opensearch/tika-config-fs-to-opensearch-legacy.xml

  4. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-legacy.xml


D) Solr Legacy Example (fileshare to Solr)

  1. From the solr directory
    1. bin/solr start
    2. bin/solr create -c tika-example-legacy && bin/solr config -c tika-example-legacy -p 8983 -action set-user-property -property update.autoCreateFields -value false
  2. From the tika-pipes-tutorial directory

    1. Set the schema in Solr: curl -F 'data=@configs/solr/solr-legacy-schema.json' http://localhost:8983/solr/tika-example-legacy/schema

    2. Configure the basePath element in FileSystemPipesIterator and FileSystemFetcher in configs/solr/tika-config-solr-legacy.xml

      Code Block
      languagexml
      titleFileSystemPipesIterator
      collapsetrue
          <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher">
            <params>
              <name>fsf</name>
              <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
            </params>
          </fetcher>
      ...
        <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator">
          <params>
            <fetcherName>fsf</fetcherName>
            <emitterName>solr1</emitterName>
            <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath>
          </params>
        </pipesIterator>


    3. java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-legacy.xml