1) Hands-on tika-eval module workshop, Part 1
November 9, 2021, Tuesday 11am EST/4pm UTC
...
Stay tuned for prerequisites, resources and an agenda!
The following is all a work in progress. Please check back right before the workshop!
Prerequisites:
- java >= 8
- tika-eval app and tika-app jars: https://dlcdn.apache.org/tika/2.1.0/tika-eval-app-2.1.0.jar and https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
- JSON editor/viewer (
jq
should be sufficient. I like Sublime with the PrettyJSON plugin https://github.com/dzhibas/SublimePrettyJson) - XLSX viewer (Excel or Open/LibreOffice)
Optional materials:
- tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
- tika-eval-core.jar: https://repo1.maven.org/maven2/org/apache/tika/tika-eval-core/2.1.0/tika-eval-core-2.1.0.jar
- If you'd like to experiment with tesseract, make sure that tesseract is installed and callable as 'tesseract' from your commandline.
Example input files (this file will be updated up to the start of the class): tika-eval-workshop-docs.tgz
...
- Some knowledge of SQL
Example docs, extracts and config files: tika-eval-workshop-20211109.tgz
Before the class, you should unzip the tika-eval-workshop-20211109.tgz (tar -xzvf tika-eval-workshop-20211109.tgz
), move the tika-app-2.1.0.jar
into the tika-eval-workshop-20211109/
folder and run tika-app on the docs
directory: java -jar tika-app-2.1.0.jar -J -t -i docs -o extracts/my_extracts
Note: There's a bug in the default logging configuration for tika-app in batch mode (e.g. "No configuration found for '4b85612c' at 'null' in 'null'..."
). This is fixed in the latest tika-app and will be available in the next release 2.1.1.
2) Hands-on tika-pipes module workshop
December 2, 2021, Thursday 12pm (NOON) EST/5pm UTC
The dial-in information is available to those who register via Meetup.
I'm currently working on this, and it should be ready by 11am EST/4pm UTC – an hour before the start
Useful documentation: tika-pipes
Prerequisites:
- java >= 8
- curl (or postman or something similar)
- Unzip tika-pipes-tutorial-20211202.zip
- In
tika-pipes-tutorial-20211202/app-bin/
:- https://dlcdn.apache.org/tika/2.1.0/tika-app-2.1.0.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
- tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
- In
tika-pipes-tutorial-20211202/server-bin/
:- tika-server-standard jar: https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.1.0/tika-emitter-fs-2.1.0.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.1.0/tika-emitter-solr-2.1.0.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.1.0/tika-emitter-opensearch-2.1.0.jar
- tika-core-2.1.1-SNAPSHOT-test-jar-with-dependencies.jar
- Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)
Advanced/Optional:
jq
or similar
Exercises
- Use fetcher in traditional
/tika /rmeta
endpointsupdate
configs/tika-config-basic.xml
<basePath
> element to get the full path totika-pipes-tutorial-20211202/docs:
Code Block language xml title FileSystemFetcher collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath> </params> </fetcher>
- start the server:
java -cp "server-bin/*" org.apache.tika.server.core.Ti
kaServerCli -c
configs/tika-config-basic.xml
curl -X PUT http://localhost:9998/rmeta -H "fetcherName:fsf" -H "fetchKey:testPDF.pdf" | jq --sort-keys
- Use /pipes handler to read from and write to a local file share
update
configs/tika-config-basic.xml
<basePath
> element to get the full path totika-pipes-tutorial-20211202/docs:
Code Block language xml title FileSystemEmitter collapse true <emitters> <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> <params> <name>fse</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/extracts</basePath> </params> </emitter> </emitters>
- start the server:
java -cp "server-bin/*" org.apache.tika.server.core.Ti
kaServerCli -c
configs/tika-config-basic.xml
curl -X POST -H "Content-Type: application/json" -d @configs/pipes-request-minimal.json http://localhost:9998/pipes
- Configure metadata filters and rerun 2.
Copy this and paste it into
configs/tika-config-basic.xml
Code Block language xml title Metadata Filters collapse true <metadataFilters> <!-- depending on the file format, some dates do not have a timezone. This filter arbitrarily assumes dates have a UTC timezone and will format all dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone. --> <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/> <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter"> <params> <excludeUnmapped>true</excludeUnmapped> <mappings> <mapping from="X-TIKA:content" to="content_s"/> <mapping from="Content-Length" to="length_i"/> <mapping from="dc:creator" to="creators_ss"/> <mapping from="dc:title" to="title_s"/> <mapping from="Content-Type" to="mime_s"/> <mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception_s"/> </mappings> </params> </metadataFilter> </metadataFilters>
- Restart the server
- Rerun the curl command and look at the output (
cat extracts/testPDF.pdf.json | jq --sort-keys
)
- Use
/async
handler file share to file sharecurl -X POST -H "Content-Type: application/json" -d @configs/async-request-minimal.json http://localhost:9998/async
curl -X POST -H "Content-Type: application/json" -d @configs/async-request-full.json http://localhost:9998/async
- Run the async processor via tika-app
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemPipesIterator
inconfigs/tika-config-app.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath> </params> </fetcher> <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>fse</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath> </params> </pipesIterator>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app.xml
Configure Solr/OpenSearch/ElasticSearch emitter and run
/pipes
handler
Solr Example (fileshare to Solr)
- From the solr directory
bin/solr start
bin/solr create -c tika-example && bin/solr config -c tika-example -p 8983 -action set-user-property -property update.autoCreateFields -value false
From the tika-pipes-tutorial directory
Set the schema in Solr:
curl -F 'data=@configs/solr/solr-parent-child-schema.json' http://localhost:8983/solr/tika-example/schema
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher
inconfigs/solr/tika-config-solr.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath> </params> </fetcher> ... <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>solr1</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath> </params> </pipesIterator>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr.xml
OpenSearch/Elasticsearch Example (fileshare to OpenSearch/ElasticSearch)
- Start opensearch via Docker:
- docker pull opensearchproject/opensearch:1.2.0
- docker run -p 9200:9200 -p 9600:9600 -e"discovery.type=single-node" opensearchproject/opensearch:1.2.0
- Curl schema to opensearch:
curl -k -T configs/opensearch/opensearch-parent-child-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test
Configure the
basePath
element inFileSystemPipesIterator
a ndFileSystemFetcher in
configs/opensearch/tika-config-opensearch.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath> </params> </fetcher> .... <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>ose</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath> </params> </pipesIterator>
Helpful commands
- delete a collection in solr:
bin/solr delete -c collection_name
- Show all docs in OpenSearch/Elasticsearch:
https://localhost:9200/tika-test/_search?pretty=true&q=*:*
3) Hands-on tika-pipes module workshop
January 24, 2022, Monday 11am EST/4pm UTC
The dial-in information is available to those who register via Meetup.
I'm currently working on this, and it should be ready by 10:30am EST/4:30pm UTC – a half hour before the start
Useful documentation: tika-pipes
Prerequisites:
- java >= 8
- curl (or postman or something similar)
- Unzip tika-pipes-tutorial-20220124.tgz
- In
tika-pipes-tutorial-20220124/app-bin/
:- https://dlcdn.apache.org/tika/2.2.1/tika-app-2.2.1.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.2.1/tika-emitter-fs-2.2.1.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.2.1/tika-emitter-solr-2.2.1.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.2.1/tika-emitter-opensearch-2.2.1.jar
- https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/437/org.apache.tika$tika-core/artifact/org.apache.tika/tika-core/2.2.2-20220124.115541-55/tika-core-2.2.2-20220124.115541-55-test-jar-with-dependencies.jar
- Optional: In
tika-pipes-tutorial-20220124/server-bin/
:- tika-server-standard jar: https://dlcdn.apache.org/tika/2.2.1/tika-server-standard-2.2.1.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.2.1/tika-emitter-fs-2.2.1.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.2.1/tika-emitter-solr-2.2.1.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.2.1/tika-emitter-opensearch-2.2.1.jar
- https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/437/org.apache.tika$tika-core/artifact/org.apache.tika/tika-core/2.2.2-20220124.115541-55/tika-core-2.2.2-20220124.115541-55-test-jar-with-dependencies.jar
- Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)
A) Fileshare to Fileshare warm up
- Run the async processor via tika-app
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemPipesIterator
inconfigs/tika-config-app-fs-to-fs.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </fetcher> <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>fse</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </pipesIterator>
Configure the
basePath
element inFileSystemPipesEmitter
inconfigs/tika-config-app-fs-to-fs.xml
Code Block language xml title FileSystemPipesIterator collapse true <emitters> <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> <params> <name>fse</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/extracts</basePath> </params> </emitter> </emitters>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app-fs-to-fs.xml
B) OpenSearch/Elasticsearch Parent-Child Example (fileshare to OpenSearch/ElasticSearch)
- Start opensearch via Docker:
- docker pull opensearchproject/opensearch:1.2.4
- docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
- Curl schema to opensearch:
curl -k -T configs/opensearch/opensearch-parent-child-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-parent-child
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher in
configs/opensearch/tika-config-fs-to-opensearch-parent-child.xml
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-parent-child.xml
B) Solr Parent-Child Example (fileshare to Solr)
- From the solr directory
bin/solr start
bin/solr create -c tika-example-parent-child && bin/solr config -c tika-example-parent-child -p 8983 -action set-user-property -property update.autoCreateFields -value false
From the tika-pipes-tutorial directory
Set the schema in Solr:
curl -F 'data=@configs/solr/solr-parent-child-schema.json' http://localhost:8983/solr/tika-example-parent-child/schema
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher
inconfigs/solr/tika-config-solr-parent-child.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </fetcher> ... <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>solr1</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </pipesIterator>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-parent-child.xml
C) OpenSearch/Elasticsearch Individual Files Example (fileshare to OpenSearch/ElasticSearch)
- Start opensearch via Docker:
- docker pull opensearchproject/opensearch:1.2.4
- docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
- Curl schema to opensearch:
curl -k -T configs/opensearch/opensearch-indiv-files-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-indiv-files
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher in
configs/opensearch/tika-config-fs-to-opensearch-indiv-files.xml
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-indiv-files.xml
C) Solr Indiv Files Example (fileshare to Solr)
- From the solr directory
bin/solr start
bin/solr create -c tika-example-indiv-files && bin/solr config -c tika-example-indiv-files -p 8983 -action set-user-property -property update.autoCreateFields -value false
From the tika-pipes-tutorial directory
Set the schema in Solr:
curl -F 'data=@configs/solr/solr-indiv-files-schema.json' http://localhost:8983/solr/tika-example-indiv-files/schema
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher
inconfigs/solr/tika-config-solr-indiv-files.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </fetcher> ... <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>solr1</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </pipesIterator>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-indiv-files.xml
D) OpenSearch/Elasticsearch Legacy Example (fileshare to OpenSearch/ElasticSearch)
- Start opensearch via Docker:
- docker pull opensearchproject/opensearch:1.2.4
- docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
- Curl schema to opensearch:
curl -k -T configs/opensearch/opensearch-legacy-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-legacy
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher in
configs/opensearch/tika-config-fs-to-opensearch-legacy.xml
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-legacy.xml
D) Solr Legacy Example (fileshare to Solr)
- From the solr directory
bin/solr start
bin/solr create -c tika-example-legacy && bin/solr config -c tika-example-legacy -p 8983 -action set-user-property -property update.autoCreateFields -value false
From the tika-pipes-tutorial directory
Set the schema in Solr:
curl -F 'data=@configs/solr/solr-legacy-schema.json' http://localhost:8983/solr/tika-example-legacy/schema
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher
inconfigs/solr/tika-config-solr-legacy.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </fetcher> ... <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>solr1</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </pipesIterator>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-legacy.xml