...
Finally, M. Caruana Galizia alerted us to the need to use maven-shade's ServicesResourceTransformer
because the third-party dependencies' services file will be overwritten unless you transform the services. See an example: here.
...
109458155
Note: the configuration of some of these features via the config file requires a nightly build of Tika after 11/8/2016 or Tika version >= 1.15.
...
We have not carried out evaluations to determine which strategy is better. We suspect that the tried and true It Depends(TM) is operative here. We added OCR'ing of the single image option because some PDFs can contain hundreds of images per page where each image is a tiny part of the overall page, and OCR would be useless. However, we recognize, that if the page is logically broken into sections, running OCR on the individual inline images might yield better results.
Note: These two options are independent. If you set extractInlineImages
to true and select an OcrStrategy
that includes OCR on the rendered page, Tika will run OCR on the extracted inline images and the rendered page.
Option 1: Configuring OCR on Inline Images
...