Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Background: In early 1.x, we had basic metadata keys that were created somewhat ad hoc.  We then added normalized metadata keys based on standards such as Dublin Core, or we at least tried to add namespaces to the metadata keys for specific file formats.  To maintain backwards compatibility, we kept the old keys and added new keys.  This led to quite a bit of metadata bloat, where we'd have the same information two or three times.  In Tika 2.x, we slimmed down the metadata keys and relied only on, say Dublin Core if it exists.


Tika 1.xTika 2.x
Author, meta:author, dc:creatordc:creator
Last-Author, meta:last-authormeta:last-author
Creation-Date, date, dcterms:createddcterms:created
Last-Modified, modified, dcterms:modifieddcterms:modified
Last-Save-Date, meta:save-datemeta:save-date
Application-Name, extended-properties:Applicationextended-properties:Application
Character Count, meta:character-countmeta:character-count
Company, extended-properties:Companyextended-properties:Company
Edit-Time, extended-properties:TotalTimeextended-properties:TotalTime
Keywords, meta:keyword, dc:subjectmeta:keyword, dc:subject
Page-Count, meta:page-countmeta:page-count
Revision-Number, cp:revisioncp:revision
subject, cp:subject, dc:subjectdc:subject
Template, extended-properties:Templateextended-properties:Tempate
Word-Count, meta:word-countmeta:word-count

Changed Metadata Keys

There are a few other subtle changes in key names listed below:

Tika 1.xTika 2.x
X-Parsed-ByX-TIKA:Parsed-By
X-TIKA:EXCEPTION:runtimeX-TIKA:EXCEPTION:container_exception


Other small metadata key changes between 1.x and 2.x

These are changes in locations of keys, not in the key names that consumers/clients will see:

  • Metadata.RESOURCE_NAME_KEY has been renamed TikaCoreProperties.RESOURCE_NAME_KEY.
  • TikaCoreProperties.KEYWORDS has been renamed removed in favor of Office.KEYWORDS.
  • Meta X-Parsed-By has changed to X-TIKA:Parsed-By
  • X-TIKA:EXCEPTION:runtime has been changed to X-TIKA:EXCEPTION:container_exception

tika-parsers – specific parser changes

...

Code Block
languagexml
titlepom.xml from 1.27
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.27</version>
</dependency>


to, e.g.

Code Block
languagexml
titlepom.xml for 2.0.0+
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
  <version>2.01.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-scientific-module</artifactId>
  <version>2.01.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-sqlite3-module</artifactId>
  <version>2.01.0</version>
</dependency>

...

Code Block
languagexml
titlepom.xml 2.0.0
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-langdetect-optimaize</artifactId>
  <version>2.01.0</version>
</dependency>

...