Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

See other individual parser pages for available configurations: TikaParserNotes.  If you notice any missing parsers, please help us document configurations for all parsers.

tika-parsers module

In Tika 2.x, we separated the 1.x tika-parsers module into three modules and packages:

  1. tika-parsers-standard – the most common parsers – should not require rest calls nor native libs (NOTE: despite the goal of this package, we do include the TesseractOCR parser which will run Tesseract if you have that installed)
  2. tika-parsers-extended – may include native libs and/or dependencies that not everyone wants (e.g. netcdf)
  3. tika-parsers-ml – may include heavy dependencies (dl4j) or parsers that rely on rest calls and external services

The goal is to allow users to select only the parsers (and dependencies) that they want.


When using tika-parsers in your project, you need to change the dependencies from:

Code Block
languagexml
titlepom.xml from 1.27
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.27</version>
</dependency>


to , e.gat least tika-parsers-standard-package.  If you want netcdf parsing and/or sqlite parsing – both of which were included in tika-parsers in 1.x – you'll need to include tika-parser-scientific-package and tika-parser-sqlite3-package.

Code Block
languagexml
titlepom.xml for 2.0.0+
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
  <version>2.17.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parser-scientific-module<package</artifactId>
  <version>2.17.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parser-sqlite3-module<package</artifactId>
  <version>2.17.0</version>
</dependency>

NOTE: As of Tika 2.7.0, we have added tika-parser-nlp-package to our release artifacts.

NOTE: As in Tika 1.x, if you need detection on container formats (e.g. OLE2: .doc, .ppt, .xls or zip-based: .xlsx, .pptx, .docx or .ogg based), you need to include the underlying Tika parsers that will parse the container files and make the detection based on the information in the container.  In Tika 2.x, this means that you need to include tika-parsers-standard-package

...