NOTE: THIS PAGE IS IN PROGRESS. PLEASE CHECK BACK FOR MORE DETAILS.

For now, see: https://downloads.apache.org/tika/2.0.0/CHANGES-2.0.0.txt

Metadata

Metadata.RESOURCE_NAME_KEY has been renamed TikaCoreProperties.RESOURCE_NAME_KEY.
TikaCoreProperties.KEYWORDS has been removed.
Meta X-Parsed-By has changed to X-TIKA:Parsed-By.

tika-parsers – specific parser changes

tika-parsers module

When using tika-parsers in you project, you need to change the dependencies from

pom.xml fro 1.27

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.27</version>
</dependency>

to

pom.xml for 2.0.0+

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
  <version>2.0.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-scientific-module</artifactId>
  <version>2.0.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-sqlite3-module</artifactId>
  <version>2.0.0</version>
</dependency>

Also, there's a small transitive dependency issue with jcl-over-slf4j between tika-parsers-standard-package 2.0.0 and tika-parser-scientific-module:2.0.0. So if you are using maven enforcer plugin, you will need to fix it by adding this:

pom.xml

<!-- Fix tika-parsers-standard-package 2.0.0 vs tika-parser-scientific-module:2.0.0 transitive dependency -->
<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>jcl-over-slf4j</artifactId>
    <version>1.7.31</version>
</dependency>

If you are checking for CVEs (recommended), the tika-parser-scientific-module:2.0.0 comes with a transitive dependency on quartz 2.2.0 which should be fixed like this:

quartz

    <dependency>
      <groupId>edu.ucar</groupId>
      <artifactId>netcdf4</artifactId>
      <version>${netcdf-java.version}</version>
      <exclusions>
        ....
        <exclusion>
          <groupId>org.quartz-scheduler</groupId>
          <artifactId>quartz</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
    <dependency>
      <groupId>org.quartz-scheduler</groupId>
      <artifactId>quartz</artifactId>
      <version>2.3.2</version>
    </dependency>

When using lang detection, you need to change now use:

pom.xml 2.0.0

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-langdetect-optimaize</artifactId>
  <version>2.0.0</version>
</dependency>

Also note that org.apache.tika.langdetect.OptimaizeLangDetector.getDefaultLanguageDetector has moved to org.apache.tika.langdetect.optimaize.OptimaizeLangDetector.getDefaultLanguageDetector.

For OCR, you can not use anymore TesseractOCRConfig.setTesseractPath(String) and TesseractOCRConfig.setTessdataPath(String) methods. They moved to the TesseractOCRParser class.

tika-app

tika-server

General

enableFileUrl has been removed in favor of a FileSystemFetcher see tika-pipes#FetchersInClassicServerEndpoints.

Configuration

tika-pipes

See the tika-pipes page.

tika-eval

tika-langid

In the 1.x branch, the default (hardwired) language identification component was the wrapper around Optimaize. If you used the following in 1.x:

pom.xml 1.27

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-langdetect</artifactId>
  <version>1.27</version>
</dependency>

In 2.x, change this to:

optimaize-lang-detect

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-langdetect-optimaize</artifactId>
  <version>2.0.x</version>
</dependency>

The original language id component that was built by Tika devs and that used to be in tika-core is now in the tika-langdetect-tika module.

Page tree

Migrating to Tika 2.0.0