...
- create new branch, e.g
jira/solr-11701
mvn dependency:tree
on the newly released Apache Tika and MEMORIZE it- upgrade all dependencies in
lucene/ivy-versions.properties
– make sure that they are in alphabetical order - add any new licenses in
solr/licenses
– must include a -LICENSE-XYZ.txt and -NOTICE.txt file for every jar - update anything new in
solr/contrib/extraction/ivy.xml
ant clean
(out of nervous habit) and then run the unit tests incontrib/dataimporthandler-extras
andcontrib/extraction
- Fix any problems in the source code, and this can include
XLSXResponseWriter
which relies on Apache POI. ant clean-jars jar-checksums
git add
new .sha1 files in solr/licenses and lucene/licenses andgit rm
old .sha1 filesant precommit
- Receive immediate errors that you missed something and go back two steps; repeat
ant precommit
as needed, waiting 15-20 minutes each time ... if you didn't break something obvious. - In my environment,
ant precommit
eventually ends in errors about broken links in html. This means you are successful!!! - Run
ant test
for kicks. Something will likely break. Try to figure out if it is caused by anything you did or just a flaky build. Bonus points if the test failure is reproducible and you report it/fix it.
Phase 2: Integration Testing Solr
...
- Build the Solr dist:
cd solr/
andant package
- Unzip your shiny new Solr and create a collection TODO: add example collection here
- Place this config file in the collection TODO: add this
- from: https://github.com/tballison/tika-addons/tree/main/solr-tika-integration/src/configs
bin\solr start
- Copy the files from tika-parsers/src/test/resources/test-documents ... make sure to remove ucar files: *.nc, *.hdf, *.fb2, *.he5 – these wreak havoc with the data importer
bin\solr start
- Navigate to the Solr admin window->
Dataimport
. - Close your eyes, cross your fingers, pray to your appropriate diet(y|ies) or not, and press
Execute
- Watch the command window to see if there were any catastrophic missing class problems
- Go to logs to see if there are any show stoppers for exceptions.
- When this completes, go to
Query
and check how many documents are actually indexed - Compare the number of documents in Solr to the number you'd get if you ran
java -jar tika-app.jar -i <input_dir> -o <output_dir>
In addition to DIH, the above configs are also set up to work with the ExtractingHandler.
You can run either the SolrJ client (https://github.com/tballison/tika-addons/blob/main/solr-tika-integration/src/main/java/org/tallison/indexers/SolrJIndexer.java) or the
Make sure to set the source directory appropriately and the solr-collection name correctly for your test files and Solr collection. Note that these indexers do not process files recursively.
Phase 3: Submit a Pull Request
- When everything looks good, commit your changes and submit a PR
Phase 4: Reflect, Rejoice, Work
- Reflect on:
- The tedium to get the dependencies right and the risks of not getting them right
- The ever present risks of jar hell by integrating Tika into Solr
- The seductive belief that Tika won't break Solr, when we know it will eventually, and we should really be keeping Tika out of Solr if at all possible...and yet maintain the awesome easy-to-get-started-ness of the current integration.
- Rejoice that Tika is being refactored out of Solr in 9x.
- Work towards whatever solution allows for an easy, out of the box extraction process for binary files