Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tika 1.xTika 2.x
Author, meta:author, dc:creatordc:creator
Last-Author, meta:last-authormeta:last-author
title, dc:titledc:title
Creation-Date, date, dcterms:createddcterms:created
Last-Modified, modified, dcterms:modifieddcterms:modified
Last-Save-Date, meta:save-datemeta:save-date
w:commentsw:Comments
Application-Name, extended-properties:Applicationextended-properties:Application
Character Count, meta:character-countmeta:character-count
Company, extended-properties:Companyextended-properties:Company
Edit-Time, extended-properties:TotalTimeextended-properties:TotalTime
Keywords, meta:keyword, dc:subjectmeta:keyword, dc:subject
Page-Count, meta:page-countmeta:page-count
Revision-Number, cp:revisioncp:revision
subject, cp:subject, dc:subjectdc:subject
Template, extended-properties:Templateextended-properties:Template
Word-Count, meta:word-countmeta:word-count
identifierdc:identifier
publisherdc:publisher
dc:description, subject (as in MSG files)dc:description (dc:subject was added back in 2.4.0).

tika-parsers – Configuring via tika-config.xml 

...

See other individual parser pages for available configurations: TikaParserNotes.  If you notice any missing parsers, please help us document configurations for all parsers.

tika-parsers module

In Tika 2.x, we separated the 1.x tika-parsers module into three modules and packages:

  1. tika-parsers-standard – the most common parsers – should not require rest calls nor native libs (NOTE: despite the goal of this package, we do include the TesseractOCR parser which will run Tesseract if you have that installed)
  2. tika-parsers-extended – may include native libs and/or dependencies that not everyone wants (e.g. netcdf)
  3. tika-parsers-ml – may include heavy dependencies (e.g. dl4j) or parsers that rely on rest calls and external services

The goal is to allow users to select only the parsers (and dependencies) that they want.

When using tika-parsers in your project, you need to change the dependencies from:

Code Block
languagexml
titlepom.xml from 1.27
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.27</version>
</dependency>


to , e.gat least tika-parsers-standard-package.  If you want netcdf parsing and/or sqlite3 parsing – both of which were included in tika-parsers in 1.x – you'll need to include tika-parser-scientific-package and/or the tika-parser-sqlite3-package.

Code Block
languagexml
titlepom.xml for 2.0.0+
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
  <version>2.17.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parser-scientific-module<package</artifactId>
  <version>2.17.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parser-sqlite3-module<package</artifactId>
  <version>2.17.0</version>
</dependency>

NOTE: As of Tika 2.7.0, we have added tika-parser-nlp-package to our release artifacts.

NOTE: As in Tika 1.x, if you need detection on container formats (e.g. OLE2: .doc, .ppt, .xls or zip-based: .xlsx, .pptx, .docx or .ogg based), you need to include the underlying Tika parsers that will parse the container files and make the detection based on the information in the container.  In Tika 2.x, this means that you need to include tika-parsers-standard-package

...