You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 30 Next »

NOTE: THIS PAGE IS IN PROGRESS.  PLEASE CHECK BACK FOR MORE DETAILS.

For now, see: https://downloads.apache.org/tika/2.0.0/CHANGES-2.0.0.txt


Metadata

Removed duplicate/triplicate keys

Background: In early 1.x, we had basic metadata keys that were created somewhat ad hoc.  We then added metadata keys based on standards such as Dublin Core, or we at least tried to add namespaces to the metadata keys for specific file formats.  To maintain backwards compatibility, we kept the old keys and added new keys.  This led to quite a bit of metadata bloat, where we'd have the same information two or three times.  In Tika 2.x, we slimmed down the metadata keys and relied only on the standards-based or name-spaced keys.  In the table below, we document the mappings.  If you notice any missing, please let us know or update the wiki.


Tika 1.xTika 2.x
Author, meta:author, dc:creatordc:creator
Last-Author, meta:last-authormeta:last-author
Creation-Date, date, dcterms:createddcterms:created
Last-Modified, modified, dcterms:modifieddcterms:modified
Last-Save-Date, meta:save-datemeta:save-date
Application-Name, extended-properties:Applicationextended-properties:Application
Character Count, meta:character-countmeta:character-count
Company, extended-properties:Companyextended-properties:Company
Edit-Time, extended-properties:TotalTimeextended-properties:TotalTime
Keywords, meta:keyword, dc:subjectmeta:keyword, dc:subject
Page-Count, meta:page-countmeta:page-count
Revision-Number, cp:revisioncp:revision
subject, cp:subject, dc:subjectdc:subject
Template, extended-properties:Templateextended-properties:Tempate
Word-Count, meta:word-countmeta:word-count

Changed Metadata Keys

There are a few other subtle changes in key names listed below:

Tika 1.xTika 2.x
X-Parsed-ByX-TIKA:Parsed-By
X-TIKA:EXCEPTION:runtimeX-TIKA:EXCEPTION:container_exception


Other small metadata key changes between 1.x and 2.x

These are changes in locations of keys, not in the key names that consumers/clients will see:

  • Metadata.RESOURCE_NAME_KEY has been renamed TikaCoreProperties.RESOURCE_NAME_KEY.
  • TikaCoreProperties.KEYWORDS has been removed in favor of Office.KEYWORDS

tika-parsers – moving away from *.properties

In 2.x, we're moving to centralize and prefer configuration for everything through a tika-config.xml file.  Two major parsers used to rely on *.properties files. 

PDFParser

PDFParser Configuration
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <!-- this is not formally necessary, but prevents loading of unnecessary parser -->
      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
    </parser>
    <parser>
      <params>
       <!-- these are the defaults; you only need to specify the ones you want 
            to modify -->
        <param name="allowExtractionForAccessibility" type="bool">true</param>
        <param name="averageCharTolerance" type="float">0.3</param>
        <param name="detectAngles" type="bool">false</param>
        <param name="extractAcroFormContent" type="bool">true</param>
        <param name="extractActions" type="bool">false</param>
        <param name="catchIntermediateIOExceptions" type="bool">true</param>
        <param name="dropThreshold" type="float">2.5</param>
        <param name="enableAutoSpace" type="bool">true</param>
        <param name="extractAnnotationText" type="bool">false</param>
        <param name="extractBookMarksText" type="bool">true</param>
        <param name="extractFontNames" type="bool">false</param>
        <param name="extractInlineImages" type="bool">false</param>
        <param name="extractUniqueInlineImagesOnly" type="bool">true</param>
        <param name="ifXFAExtractOnlyXFA" type="bool">false</param>
        <param name="maxMainMemoryBytes" type="long">-1</param>
        <param name="ocrDPI" type="int">300</param>
        <param name="ocrImageFormatName" type="string">png</param>
        <param name="ocrImageQuality" type="float">1.0</param>
        <param name="ocrRenderingStrategy" type="string">ALL</param>
        <param name="ocrStrategy" type="string">auto</param>
        <param name="ocrStrategyAuto" type="string">better</param>
        <param name="ocrImageType" type="string">gray</param>
        <param name="setKCMS" type="bool">false</param>
        <param name="sortByPosition" type="bool">false</param>
        <param name="spacingTolerance" type="float">0.5</param>
        <param name="suppressDuplicateOverlappingText" type="bool">false</param> 
      </params>
    </parser>
  </parsers>
</properties>

TesseractOCRParser


tika-parsers module

When using tika-parsers in your project, you need to change the dependencies from

pom.xml from 1.27
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.27</version>
</dependency>


to, e.g.

pom.xml for 2.0.0+
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
  <version>2.1.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-scientific-module</artifactId>
  <version>2.1.0</version>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-sqlite3-module</artifactId>
  <version>2.1.0</version>
</dependency>


Also, there's a small transitive dependency issue with jcl-over-slf4j between tika-parsers-standard-package 2.0.0 and tika-parser-scientific-module:2.0.0. So if you are using maven enforcer plugin, you will need to fix it by adding this:

pom.xml
<!-- Fix tika-parsers-standard-package 2.0.0 vs tika-parser-scientific-module:2.0.0 transitive dependency -->
<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>jcl-over-slf4j</artifactId>
    <version>1.7.31</version>
</dependency>

If you are checking for CVEs (recommended), the tika-parser-scientific-module:2.0.0 comes with a transitive dependency on quartz 2.2.0 which should be fixed like this:

quartz
  <dependency>
    <groupId>edu.ucar</groupId>
    <artifactId>netcdf4</artifactId>
    <version>${netcdf-java.version}</version>
    <exclusions>
      ...
      <exclusion>
        <groupId>org.quartz-scheduler</groupId>
        <artifactId>quartz</artifactId>
      </exclusion>
    </exclusions>
  </dependency>
  <dependency>
    <groupId>org.quartz-scheduler</groupId>
    <artifactId>quartz</artifactId>
    <version>2.3.2</version>
  </dependency>


When using lang detection, you need to change now use:

pom.xml 2.0.0
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-langdetect-optimaize</artifactId>
  <version>2.1.0</version>
</dependency>

Also note that org.apache.tika.langdetect.OptimaizeLangDetector.getDefaultLanguageDetector has moved to org.apache.tika.langdetect.optimaize.OptimaizeLangDetector.getDefaultLanguageDetector.

For OCR, you can not use anymore TesseractOCRConfig.setTesseractPath(String) and TesseractOCRConfig.setTessdataPath(String) methods. They moved to the TesseractOCRParser class.

tika-parsers-module optional dependencies

zstd

The zstd dependency includes native libs and is not packaged with the tika-parsers-module.  If you'd like to parse zstd files, include:

zstd-jni
    <dependency>
      <groupId>com.github.luben</groupId>
      <artifactId>zstd-jni</artifactId>
      <version>1.5.0-4</version>
    </dependency>

TIFF and JPEG2000

If you plan to write TIFFs with Tika (rendering of PDF pages for OCR), and if the BSD-3 with nuclear disclaimer license is acceptable to you, include:

jai-imageio-core
<dependency>
  <groupId>com.github.jai-imageio</groupId>
  <artifactId>jai-imageio-core</artifactId>
  <version>1.4.0</version>
</dependency>

If you plan on processing JPEG2000 images (most common use case would be rendering PDF pages for OCR), and if the BSD-3 with nuclear disclaimer license is acceptable to you,  include:

jpeg2000
<dependency>
  <groupId>com.github.jai-imageio</groupId>
  <artifactId>jai-imageio-jpeg2000</artifactId>
  <version>1.4.0</version>
</dependency>

Note! In 2.x, Tika will not warn you if a PDF page that you're trying to render has a JPEG2000 in it.  PDFBox will log a warning.


tika-app

tika-server

General

Configuration

tika-pipes

See the tika-pipes page.

tika-eval

tika-langid

In the 1.x branch, the default (hardwired) language identification component was the wrapper around Optimaize.  If you used the following in 1.x:

pom.xml 1.27
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-langdetect</artifactId>
  <version>1.27</version>
</dependency>

In 2.x, change this to:

optimaize-lang-detect
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-langdetect-optimaize</artifactId>
  <version>2.0.x</version>
</dependency>

The original  language id component that was built by Tika devs and that used to be in tika-core is now in the tika-langdetect-tika module. 

  • No labels