...
- Use fetcher in traditional
/tika /rmeta
endpointsupdate
configs/tika-config-basic.xml
<basePath
> element to get the full path totika-pipes-tutorial-2022120220211202/docs:
Code Block language xml title FileSystemFetcher collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-2022120220211202/docs</basePath> </params> </fetcher>
- start the server:
java -cp "server-bin/*" org.apache.tika.server.core.Ti
kaServerCli -c
configs/tika-config-basic.xml
curl -X PUT http://localhost:9998/rmeta -H "fetcherName:fsf" -H "fetchKey:testPDF.pdf" | jq --sort-keys
- Use /pipes handler to read from and write to a local file share
update
configs/tika-config-basic.xml
<basePath
> element to get the full path totika-pipes-tutorial-2022120220211202/docs:
Code Block language xml title FileSystemEmitter collapse true <emitters> <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> <params> <name>fse</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-2022120220211202/extracts</basePath> </params> </emitter> </emitters>
- start the server:
java -cp "server-bin/*" org.apache.tika.server.core.Ti
kaServerCli -c
configs/tika-config-basic.xml
curl -X POST -H "Content-Type: application/json" -d @configs/pipes-request-minimal.json http://localhost:9998/pipes
- Configure metadata filters and rerun 2.
Copy this and paste it into
configs/tika-config-basic.xml
Code Block language xml title Metadata Filters collapse true <metadataFilters> <!-- depending on the file format, some dates do not have a timezone. This filter arbitrarily assumes dates have a UTC timezone and will format all dates as yyyy-MM-dd'T'HH:mm:ss'Z' whether or not they actually have a timezone. --> <metadataFilter class="org.apache.tika.metadata.filter.DateNormalizingMetadataFilter"/> <metadataFilter class="org.apache.tika.metadata.filter.FieldNameMappingFilter"> <params> <excludeUnmapped>true</excludeUnmapped> <mappings> <mapping from="X-TIKA:content" to="content_s"/> <mapping from="Content-Length" to="length_i"/> <mapping from="dc:creator" to="creators_ss"/> <mapping from="dc:title" to="title_s"/> <mapping from="Content-Type" to="mime_s"/> <mapping from="X-TIKA:EXCEPTION:container_exception" to="tika_exception_s"/> </mappings> </params> </metadataFilter> </metadataFilters>
- Restart the server
- Rerun the curl command and look at the output (
cat extracts/testPDF.pdf.json | jq --sort-keys
)
- Use
/async
handler file share to file sharecurl -X POST -H "Content-Type: application/json" -d @configs/async-request-simpleminimal.json http://localhost:9998/async
curl -X POST -H "Content-Type: application/json" -d @configs/async-request-full.json http://localhost:9998/async
- Run the async processor via tika-app
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemPipesIterator
inconfigs/tika-config-app.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath> </params> </fetcher> <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>fse</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20211202/docs</basePath> </params> </pipesIterator>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app.xml
Configure Solr/OpenSearch/ElasticSearch emitter and run
/pipes
handler
...
- delete a collection in solr:
bin/solr delete -c collection_name
- Show all docs in OpenSearch/Elasticsearch:
https://localhost:9200/tika-test/_search?pretty=true&q=*:*
3) Hands-on tika-pipes module workshop
January 24, 2022, Monday 11am EST/4pm UTC
The dial-in information is available to those who register via Meetup.
I'm currently working on this, and it should be ready by 10:30am EST/4:30pm UTC – a half hour before the start
Useful documentation: tika-pipes
Prerequisites:
- java >= 8
- curl (or postman or something similar)
- Unzip tika-pipes-tutorial-20220124.tgz
- In
tika-pipes-tutorial-20220124/app-bin/
:- https://dlcdn.apache.org/tika/2.2.1/tika-app-2.2.1.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.2.1/tika-emitter-fs-2.2.1.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.2.1/tika-emitter-solr-2.2.1.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.2.1/tika-emitter-opensearch-2.2.1.jar
- https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/437/org.apache.tika$tika-core/artifact/org.apache.tika/tika-core/2.2.2-20220124.115541-55/tika-core-2.2.2-20220124.115541-55-test-jar-with-dependencies.jar
- Optional: In
tika-pipes-tutorial-20220124/server-bin/
:- tika-server-standard jar: https://dlcdn.apache.org/tika/2.2.1/tika-server-standard-2.2.1.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-fs/2.2.1/tika-emitter-fs-2.2.1.jar
- https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-solr/2.2.1/tika-emitter-solr-2.2.1.jar OR https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-opensearch/2.2.1/tika-emitter-opensearch-2.2.1.jar
- https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/437/org.apache.tika$tika-core/artifact/org.apache.tika/tika-core/2.2.2-20220124.115541-55/tika-core-2.2.2-20220124.115541-55-test-jar-with-dependencies.jar
- Installation of Apache Solr (~8.9.x) and/or OpenSearch (~1.x) and/or Elasticsearch (7.x)
A) Fileshare to Fileshare warm up
- Run the async processor via tika-app
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemPipesIterator
inconfigs/tika-config-app-fs-to-fs.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </fetcher> <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>fse</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </pipesIterator>
Configure the
basePath
element inFileSystemPipesEmitter
inconfigs/tika-config-app-fs-to-fs.xml
Code Block language xml title FileSystemPipesIterator collapse true <emitters> <emitter class="org.apache.tika.pipes.emitter.fs.FileSystemEmitter"> <params> <name>fse</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/extracts</basePath> </params> </emitter> </emitters>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/tika-config-app-fs-to-fs.xml
B) OpenSearch/Elasticsearch Parent-Child Example (fileshare to OpenSearch/ElasticSearch)
- Start opensearch via Docker:
- docker pull opensearchproject/opensearch:1.2.4
- docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
- Curl schema to opensearch:
curl -k -T configs/opensearch/opensearch-parent-child-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-parent-child
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher in
configs/opensearch/tika-config-fs-to-opensearch-parent-child.xml
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-parent-child.xml
B) Solr Parent-Child Example (fileshare to Solr)
- From the solr directory
bin/solr start
bin/solr create -c tika-example-parent-child && bin/solr config -c tika-example-parent-child -p 8983 -action set-user-property -property update.autoCreateFields -value false
From the tika-pipes-tutorial directory
Set the schema in Solr:
curl -F 'data=@configs/solr/solr-parent-child-schema.json' http://localhost:8983/solr/tika-example-parent-child/schema
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher
inconfigs/solr/tika-config-solr-parent-child.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </fetcher> ... <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>solr1</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </pipesIterator>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-parent-child.xml
C) OpenSearch/Elasticsearch Individual Files Example (fileshare to OpenSearch/ElasticSearch)
- Start opensearch via Docker:
- docker pull opensearchproject/opensearch:1.2.4
- docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
- Curl schema to opensearch:
curl -k -T configs/opensearch/opensearch-indiv-files-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-indiv-files
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher in
configs/opensearch/tika-config-fs-to-opensearch-indiv-files.xml
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-indiv-files.xml
C) Solr Indiv Files Example (fileshare to Solr)
- From the solr directory
bin/solr start
bin/solr create -c tika-example-indiv-files && bin/solr config -c tika-example-indiv-files -p 8983 -action set-user-property -property update.autoCreateFields -value false
From the tika-pipes-tutorial directory
Set the schema in Solr:
curl -F 'data=@configs/solr/solr-indiv-files-schema.json' http://localhost:8983/solr/tika-example-indiv-files/schema
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher
inconfigs/solr/tika-config-solr-indiv-files.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </fetcher> ... <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>solr1</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </pipesIterator>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-indiv-files.xml
D) OpenSearch/Elasticsearch Legacy Example (fileshare to OpenSearch/ElasticSearch)
- Start opensearch via Docker:
- docker pull opensearchproject/opensearch:1.2.4
- docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.2.4
- Curl schema to opensearch:
curl -k -T configs/opensearch/opensearch-legacy-mappings.json -u admin:admin -H "Content-Type:application/json" https://localhost:9200/tika-test-legacy
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher in
configs/opensearch/tika-config-fs-to-opensearch-legacy.xml
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/opensearch/tika-config-fs-to-opensearch-legacy.xml
D) Solr Legacy Example (fileshare to Solr)
- From the solr directory
bin/solr start
bin/solr create -c tika-example-legacy && bin/solr config -c tika-example-legacy -p 8983 -action set-user-property -property update.autoCreateFields -value false
From the tika-pipes-tutorial directory
Set the schema in Solr:
curl -F 'data=@configs/solr/solr-legacy-schema.json' http://localhost:8983/solr/tika-example-legacy/schema
Configure the
basePath
element inFileSystemPipesIterator
andFileSystemFetcher
inconfigs/solr/tika-config-solr-legacy.xml
Code Block language xml title FileSystemPipesIterator collapse true <fetcher class="org.apache.tika.pipes.fetcher.fs.FileSystemFetcher"> <params> <name>fsf</name> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </fetcher> ... <pipesIterator class="org.apache.tika.pipes.pipesiterator.fs.FileSystemPipesIterator"> <params> <fetcherName>fsf</fetcherName> <emitterName>solr1</emitterName> <basePath>/Users/allison/Desktop/tika-pipes-tutorial-20220124/docs</basePath> </params> </pipesIterator>
java -cp "app-bin/*" org.apache.tika.cli.TikaCLI -a --config=configs/solr/tika-config-solr-legacy.xml