Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Finally, M. Caruana Galizia alerted us to the need to use maven-shade's ServicesResourceTransformer because the third-party dependencies' services file will be overwritten unless you transform the services. See an example: here.

...

109458155

Note: the configuration of some of these features via the config file requires a nightly build of Tika after 11/8/2016 or Tika version >= 1.15.

...

We have not carried out evaluations to determine which strategy is better. We suspect that the tried and true It Depends(TM) is operative here. We added OCR'ing of the single image option because some PDFs can contain hundreds of images per page where each image is a tiny part of the overall page, and OCR would be useless. However, we recognize, that if the page is logically broken into sections, running OCR on the individual inline images might yield better results.

Note: These two options are independent.  If you set extractInlineImages to true and select an OcrStrategy that includes OCR on the rendered page, Tika will run OCR on the extracted inline images and  the rendered page. 

Option 1: Configuring OCR on Inline Images

...