Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • OCR is now triggered automatically for PDFs if tesseract is on the user's path see (TikaOCR#disable-ocr) for how to disable OCR.
  • Removed deprecated Metadata keys/properties and moved some commonly used keys from Metadata to TikaCoreProperties (such as TikaCoreProperties.RESOURCE_NAME_KEY) (TIKA-1974).  See below for a list of changed keys.
  • We upgraded from log4j to log4j2 in tika-app, tika-server and anywhere else we used to use log4j.
  • The tika-parsers package has been split into several sub packages, inluding: tika-parsers-standard-package, tika-parser-scientific-package and tika-parser-sqlite3-package. You will need the tika-parsers-standard-package for complete detection of container-based formats such as .doc, .ppt, .xls, .docx, .pptx, .xlsx and others.
  • tika-app only includes parsers in tika-parsers-standard-package; users have to add tika-parser-scientific-package and tika-parser-sqlite3-package if desired.
  • tika-server is now tika-server-standard and only includes parsers in tika-parsers-standard-package
  • tika-server is now run in --spawnChild mode by default.
  • Removed deprecated PDFPreflightParser (TIKA-3437). 
  • Parsers are now configured via tika-config.xml on instantiation. We have moved away from configuration via .properties files because of confusion among users. This affects the PDFParser, TesseractOCRParser and the StringsParser. See below for links to the specific parsers.
  • Changed namespaces of translator implementations (e.g. org.apache.tika.language.translate.impl) to avoid split-package with tika-core.

...