...
113707915
Background
So, you've integrated Apache Tika into your framework, tried it on a couple of thousand files and all works well. Problem solved!
...
- Regular catchable exceptions
- 2. OutOfMemory errors which can put the jvm in an unreliable state
- 3. Permanent hangs (Tika can chew up massive amounts of resources and go forever)
- 4. Security vulnerabilities (e.g. CVE-2016-6809 and CVE-2016-4434)
Please note that for 3., permanent hangs – you cannot terminate the Thread. Thread's stop, suspend, destroy sound like they'll do the trick, but they won't. You need to kill the entire process. See TIKA-456.
As of Tika 1.15, we added a MockParser 113707915 in the tika-core-tests.jar that will allow you to test your framework against items 1-3. Simply add that jar to your class path and then include a <mock> xml file in your set of test documents, and crash, crash away.
...
<throw class="my.evil.DeserializationAttack">bwahahaha</throw>
Usage
Below are several options for adding the dependency.
Including the tika-core-tests dependency in your project
<dependency>
<groupId>${project.groupId}</groupId>
<artifactId>tika-core</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
Tika-app
Place the tika-app.jar and the tika-core-tests.jar in a "bin" directory.
...
This shows all of the examples of what you can do.
No Format |
---|
<?xml version="1.0" encoding="UTF-8" ?> <mock> <!-- this file offers all of the options as documentation. Parsing will stop at an IOException, of course --> <!-- action can be "add" or "set" --> <metadata action="add" name="author">Nikolai Lobachevsky</metadata> <!-- element is the name of the sax event to write, p=paragraph if the element is not specified, the default is <p> --> <write element="p">some content</write> <!-- write something to System.out --> <print_out>writing to System.out</print_out> <!-- write something to System.err --> <print_err>writing to System.err</print_err> <!-- hang millis: how many milliseconds to pause. The actual hang time will probably be a bit longer than the value specified. heavy: whether or not the hang should do something computationally expensive. If the value is false, this just does a Thread.sleep(millis). This attribute is optional, with default of heavy=false. pulse_millis: (required if "heavy" is true), how often to check to see whether the thread was interrupted or that the total hang time exceeded the millis interruptible: whether or not the parser will check to see if its thread has been interrupted; this attribute is optional with default of true --> <hang millis="100" heavy="true" pulse_millis="10" interruptible="true" /> <!-- As of Tika 1.27/2.0, we've integrated FakeLoad (https://github.com/msigwart/fakeload) which enables much more precision for resource consumption than does "hang" millis = milliseconds to run, cpu is an integer % of the cpu to peg and mb is the integer amount of memory to consume in megabytes --> <fakeload millis="100" cpu="10" mb="10"/> <!-- throw an exception or error; optionally include a message or not --> <throw class="java.io.IOException">not another IOException</throw> <!-- perform a genuine OutOfMemoryError --> <oom/> <!-- perform a system exit...what parser would do that?! We had one once... --> <system_exit/> <!-- interrupt the thread --> <thread_interrupt/> <!-- print to stdout --> <print_out>some junk</print_out> <!-- print to stderr --> <print_err>some junk</print_err> </mock> |