Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

113707915

Background

So, you've integrated Apache Tika into your framework, tried it on a couple of thousand files and all works well. Problem solved!

...

  1. Regular catchable exceptions
  2. 2. OutOfMemory errors which can put the jvm in an unreliable state
  3. 3. Permanent hangs (Tika can chew up massive amounts of resources and go forever)
  4. 4. Security vulnerabilities (e.g. CVE-2016-6809 and CVE-2016-4434)

Please note that for 3., permanent hangs – you cannot terminate the Thread. Thread's stop, suspend, destroy sound like they'll do the trick, but they won't. You need to kill the entire process. See TIKA-456.

As of Tika 1.15, we added a MockParser 113707915 in the tika-core-tests.jar that will allow you to test your framework against items 1-3. Simply add that jar to your class path and then include a <mock> xml file in your set of test documents, and crash, crash away.

...

<throw class="my.evil.DeserializationAttack">bwahahaha</throw>

Usage

Below are several options for adding the dependency.

Including the tika-core-tests dependency in your project

<dependency>
<groupId>${project.groupId}</groupId>
<artifactId>tika-core</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>

Tika-app

Place the tika-app.jar and the tika-core-tests.jar in a "bin" directory.

...

This shows all of the examples of what you can do.

No Format

<?xml version="1.0" encoding="UTF-8" ?>

<mock>
    <!-- this file offers all of the options as documentation.
    Parsing will stop at an IOException, of course
    -->

    <!-- action can be "add" or "set" -->
    <metadata action="add" name="author">Nikolai Lobachevsky</metadata>

    <!-- element is the name of the sax event to write, p=paragraph
        if the element is not specified, the default is <p> -->

    <write element="p">some content</write>

    <!-- write something to System.out -->
    <print_out>writing to System.out</print_out>

    <!-- write something to System.err -->
    <print_err>writing to System.err</print_err>

    <!-- hang
        millis: how many milliseconds to pause.  The actual hang time will probably
            be a bit longer than the value specified.        
        heavy: whether or not the hang should do something computationally expensive.
            If the value is false, this just does a Thread.sleep(millis).
            This attribute is optional, with default of heavy=false.
        pulse_millis: (required if "heavy" is true), how often to check to see
            whether the thread was interrupted or that the total hang time exceeded the millis
        interruptible: whether or not the parser will check to see if its thread
            has been interrupted; this attribute is optional with default of true
    -->
    <hang millis="100" heavy="true" pulse_millis="10" interruptible="true" />

    <!-- As of Tika 1.27/2.0, we've integrated FakeLoad (https://github.com/msigwart/fakeload)
         which enables much more precision for resource consumption than does 
         "hang"
    
         millis = milliseconds to run, cpu is an integer % of the cpu to peg 
         and mb is the integer amount of memory to consume in megabytes -->
    <fakeload millis="100" cpu="10" mb="10"/>
    
    <!-- throw an exception or error; optionally include a message or not -->
    <throw class="java.io.IOException">not another IOException</throw>
    
    <!-- perform a genuine OutOfMemoryError -->
    <oom/>

    <!-- perform a system exit...what parser would do that?!  We had one once... -->
    <system_exit/>

    <!-- interrupt the thread -->
    <thread_interrupt/>

    <!-- print to stdout -->
    <print_out>some junk</print_out>

    <!-- print to stderr -->
    <print_err>some junk</print_err>

</mock>

References

  1. Tika to Ride 2. Evaluating Text Extraction