...
- Build the Solr dist:
cd solr/
andant package
- Unzip your shiny new Solr and create a collection TODO: add example collection herePlace this config file in the collection TODO: add thisfrom: https://github.com/tballison/tika-addons/tree/main/solr-tika-integration/src/configs
bin\solr start
- Copy the files from tika-parsers/src/test/resources/test-documents ... make sure to remove ucar files: *.nc, *.hdf, *.fb2, *.he5 – these wreak havoc with the data importer
- Navigate to the Solr admin window->
Dataimport
. - Close your eyes, cross your fingers, pray to your appropriate diet(y|ies) or not, and press
Execute
- Watch the command window to see if there were any catastrophic missing class problems
- Go to logs to see if there are any show stoppers for exceptions.
- When this completes, go to
Query
and check how many documents are actually indexed - Compare the number of documents in Solr to the number you'd get if you ran
java -jar tika-app.jar -i <input_dir> -o <output_dir>
In addition to DIH, the above configs are also set up to work with the ExtractingHandler.
You can run either the SolrJ client (https://github.com/tballison/tika-addons/blob/main/solr-tika-integration/src/main/java/org/tallison/indexers/SolrJIndexer.java) or the
Make sure to set the source directory appropriately and the solr-collection name correctly for your test files and Solr collection. Note that these indexers do not process files recursively.
Phase 3: Submit a Pull Request
...