You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Tika relies on numerous dependencies.  In rare cases, Tika can go into infinite loops or allocate surprising amounts of memory (OutOfMemoryExceptions (OOMs)).  If you are processing enough documents in the wild, you will run into these challenges and you must defend against them. 

The Tika project offers some defenses against these denial of service vulnerabilities:

  1. The ForkParser – this forks a child process and will protect against OOM and infinite loops.
  2. tika-batch – if you are processing files at desktop/vm scale (not cloud scale), you can run tika-batch via tika-app:
    1.  java -jar tika-app.jar -i <input_dir> -o <output_dir>
  3. tika-server – if you are using tika-server, start the server with `–spawnChild` mode, and it will fork a child process 

When you come across a file that causes catastrophic problems and if you are able to share that triggering file, we will try to fix the source of the problem if we can. 

We have recently added a rudimentary fuzzing module to identify some of these vulnerabilities. 

We document our fixed vulnerabilities here: https://tika.apache.org/security.html, and finally, we offer the MockParser in tika-core tests that will allow you to test the robustness of your system against infinite loops, out of memory exceptions and other serious problems.

In short, we do what we can, but given what we've seen before and given the size of our dependencies' codebases, we can't assert that Tika is safe.  If you are processing high volumes of untrustworthy data, please, please avoid running Tika in the same process as anything that matters, such as your indexer or natural language processing code.  

Please see slide 12 for more details: http://events17.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf



  • No labels