Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

NOTE: THIS PAGE IS IN PROGRESS.  PLEASE CHECK BACK FOR MORE DETAILS.

For now, see: https://archive.apache.org/dist/tika/2.0.0/CHANGES-2.0.0.txt

Major breaking changes

  • OCR is now triggered automatically for PDFs if tesseract is on the user's path see (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr) for how to disable OCR.
  • Removed deprecated Metadata keys/properties and moved some commonly used keys from Metadata to TikaCoreProperties (such as TikaCoreProperties.RESOURCE_NAME_KEY) (TIKA-1974).  See below for a list of changed keys.
  • We upgraded from log4j to log4j2 in tika-app, tika-server and anywhere else we used to use log4j.
  • The tika-parsers package has been split into several sub packages, inluding: tika-parsers-standard-package, tika-parser-scientific-package and tika-parser-sqlite3-package.
  • tika-app only includes parsers in tika-parsers-standard-package; users have to add tika-parser-scientific-package and tika-parser-sqlite3-package if desired.
  • tika-server is now tika-server-standard and only includes parsers in tika-parsers-standard-package
  • tika-server is now run in --spawnChild mode by default.
  • Removed deprecated Metadata keys/properties (TIKA-1974).  See below for a list of changed keys.Removed deprecated PDFPreflightParser (TIKA-3437). 
  • Parsers are now configured via tika-config.xml on instantiation. We have moved away from configuration via .properties files because of confusion among users. This affects the PDFParser, TesseractOCRParser and the StringsParser. See below for links to the specific parsers.
  • Changed namespaces of translator implementations (e.g. org.apache.tika.language.translate.impl) to avoid split-package with tika-core.

For more details on changes in tika-server in 2.x, please see: TikaServer in Tika 2.x.

Metadata

Breaking Metadata Key Changes Between 1.x and 2.x

These are changes in locations of keys, not in the key names that consumers/clients will see:

  • Metadata.RESOURCE_NAME_KEY has been renamed TikaCoreProperties.RESOURCE_NAME_KEY.
  • TikaCoreProperties.KEYWORDS has been removed in favor of Office.KEYWORDS

Changed Metadata Keys

There are a few other subtle changes in key names listed below:

Tika 1.xTika 2.x
X-Parsed-ByX-TIKA:Parsed-By
X-TIKA:EXCEPTION:runtimeX-TIKA:EXCEPTION:container_exception

Removed duplicate/triplicate keys

...

Tika 1.xTika 2.x
Author, meta:author, dc:creatordc:creator
Last-Author, meta:last-authormeta:last-author
Creation-Date, date, dcterms:createddcterms:created
Last-Modified, modified, dcterms:modifieddcterms:modified
Last-Save-Date, meta:save-datemeta:save-date
Application-Name, extended-properties:Applicationextended-properties:Application
Character Count, meta:character-countmeta:character-count
Company, extended-properties:Companyextended-properties:Company
Edit-Time, extended-properties:TotalTimeextended-properties:TotalTime
Keywords, meta:keyword, dc:subjectmeta:keyword, dc:subject
Page-Count, meta:page-countmeta:page-count
Revision-Number, cp:revisioncp:revision
subject, cp:subject, dc:subjectdc:subject
Template, extended-properties:Templateextended-properties:Template
Word-Count, meta:word-countmeta:word-count

Changed Metadata Keys

There are a few other subtle changes in key names listed below:

...

Other small metadata key changes between 1.x and 2.x

These are changes in locations of keys, not in the key names that consumers/clients will see:

...

tika-parsers – Configuring via tika-config.xml 

...